Instructions for using the code for protein structure refinement against SAXS data Using a FastSAXS module

 Alexander Grishaev


Sep 3, 2009

Program names are in bold, input file names in bold/italic

This mode of refining the structure against SAXS data achieves a fast calculation of the scattering data predicted from the structural model by approximating an exact average of molecular orientations with a discrete sum over the set of equi-distant directions. General procedures for data processing and preparation prior to their input to the structural refinement modul are listed below. In the FastSAXS mode the complex scattering amplitude and its approximate angular average are used instead of the Debye formula. Therefore, globbic correction factors are not required as the summation includes all heavy atoms. The setup is illustrated in the approximate_angular_averaging archive. The approach requires calculation of the form factors for the heavy atoms in the macromolecule and surface water corrections using the procedure described below. The required input parameters are: number of fitted data points, extrapolated experimental scattering intensity at zero angle (can be calculated either from a Guinier plot, or a P(r) analysis), force constant for the fit (usually in the 25-100 range), the grid type (Fibonacci or spiral), and the parameter for the number of vectors on the grid (number order for the Fibonacci grid or number of vectors for the spiral grid). For the Fibonacci grid, values of 10-12 should provide an accurate representation for the scattering data up to qmax=0.3-0.4A-1. For the spiral grid, values of ~100 produce a similar quality of the predicted scattering data. The last parameters are the chain name and residue range for the segments active in the scattering data calculation. Scripts and programs used for structure refinement against SAXS data are listed in the archive.


General comments on data collection, processing and analysis.

Perfect match between the sample and buffer is essential for an accurate interpretation of the subtracted scattering data. Using the buffer the protein powder was dissolved in is not acceptable as minute salt concentration differences will cause a non-negligible difference between the solvent and buffer baselines. Acceptable ways to match solvent with the sample are (i) an extensive dialysis or (ii) passing of multiple fractions of solvent though the sample with ultracentrifugation though a semi-permeable membrane that keeps the solute inside the centrifugation unit. The latter method is recommended in case of unstable samples. The mismatch between the sample and buffer can only be detected if precise wide-angle data (qmax at least ~2.2 A-1) are recorded. When such data are not available, the effects of mismatch can include inaccurate determination of Rg and I(0). In cases of over-subtracted data (that decays too rapidly with increasing q relative to the properly recorded data) the sample might appear aggregated from the (P) analysis.

Before interpreting SAXS data, you have to convince yourself that your data corresponds to an isolated-particle form factor, free from any inter-particle correlation effects or aggregation/oligomerization. Several methods can be used to determine whether this is the case:

The absence of concentration dependence of the scattering data at low q (as the concentration is decreased by 50%, 75%, etc). indicates the absence of structure factor. If there is concentration dependence, the data can be linearly extrapolated to zero concentration using 3-6 concentration points going as low as the signal allows – 0.2-0.5 mg/mL on a synchrotron or 2-3 mg/mL on a lab source. A possible alternative it to find a concentration (not necessarily the lowest one in the series) which gives data indistinguishable from the lower concentrations, and use it for further analysis. Salt can also be added (100-200 mM) to suppress the electrostatic interaction between the molecules. This may lead to increased aggregation as well, so one has to be careful. In 150 mM salt buffer, absence of structure factor is often reached at ~3-5 mg/mL for proteins and 0.5-1 mg/mL for DNA/RNA. Data at higher concentrations are likely (especially for RNA/DNA) to contain an effect of the structure factor.

A possibility of aggregation or oligomerization equilibrium in the sample has to be ruled out. Linearity of the Guinier plot (up to q*Rg = 1.3) or smooth decay to zero of the P(r) at dmax can be used to rule out aggregation. In the opposite case, when the tail of P(r) keeps extending as dmax is increased, non-specific aggregation is likely. If the amount of dimer is small (below 10%), the effects are more subtle. The P(r) may still go to zero nicely, but the dmax and Rg are going to be higher than expected from the monomeric form. These effects are easier to detect only when the amount of dimer is large enough to yield parameters that are clearly incompatible with the monomeric ones. In some cases the inter-particle interference and aggregation can mask each other, making Rg appear normal at a particular concentration (they produce opposing effects on the low-q data). One method is to record scattering data of a standard with accurately known concentration (could be lysozyme, you’ll have to do zero-concentration extrapolation for it and watch out for radiation damage at synchrotron intensities), accurately measure the concentration of your protein (could be UV-VIS at denaturating conditions), and use the fact that I(q=0) should be proportional to MW*concentration (mg/mL). If your experimental I(0) value is substantially higher that predicted, your sample may be (at least, partially) aggregated. This method is quite sensitive because I(0) is the most precise parameter from the scattering data, but it takes more time.

 At this point I am assuming that the data are free from both structure factor and aggregation, and are consistent with the modeled oligomerization state. I will also assume that the input data are not smeared by wavelength or geometry. It means that this software would be inapplicable to the neutron scattering data (polychromatic radiation). Any geometry smearing, such as due to the slit shape of the incident beam, need to be corrected for before structure refinement (for example using GNOM software). In summary, I am assuming that the input data comes from a monochromatic source in pinhole geometry with negligible detector distortions. The software is also inapplicable for very high q data (q> ~1), for which accurate data measurement and predictions become difficult.

fsglob_prot.f program is used to obtain the form factors of each heavy atom at every experimental q value in the fitted data file. For this the following is needed: a family of structures (typically the best structures before SAXS data fit, not the initial random coil/extended strand/etc models); atomic form factors in the 4-Gaussian parameterization (formfact.inp, included); solvent volumes displaced by each atom (Vj.inp, included, matches D. Svergun’s values used for CRYSOL); data (your own); file specifying the heavy atoms grouped with their attached hydrogens (choose either glob_pdb.inp if atom names are in the PDB convention or glob_cns.inp if they are in the Xplor/CNS convention, both are included), and file specifying residues (choose either res_PDB.inp if atom names are in the PDB convention or res_CNS.inp if they are in the Xplor/CNS convention, both are included. The Fortran fsglob_globf.f code has to be edited to include the following: data file name and number of data points. The files.inp file (included) has to be edited to specify how many PDB files are going to be read in (1st line), and file names of each file (the following lines, 4-letter format). The program code has then to be compiled and executed. The output will consist of the following: fsglob.out (average globbic form factors, the only file that will be needed for the subsequent work), fs2glob.out (rms of these values, for your information only), at_scat.out (atomic form factors, for your information only) and glob_dist.out (average and rms of the atomic distances within globs, for your information only, but check that rms values are small and average values match the expected bond lengths and inter-atomic distances otherwise there might be a problem in glob specification or structural geometry).

After this is done, the surface water layer correction will need to be computed. This is done by running CRYSOL 2.5 (no explicit hydrogens) in a particular way, described below. Basically, the current structure is fitted to the experimental data by CRYSOL twice: first, when forcing drho=0.0 and second, when letting drho float within its default range. All other adjustable parameters (average atomic radius and total excluded volume) are kept fixed at their default values. This is what I do after I start the program:

Read in the PDB file (option 0)
Order of harmonics: 50
Order of Fibonacci grid: 18
Maximum s value (Crysol uses s=4*p*sin(q)/l): whatever is higher than the experimental qmax
Number of points: press ENTER for default.
Fit the experimental curve: press ENTER for default and select your data file.
Angular units: q units should be A-1 (option 1).
Solvent electron density – press ENTER for default (you’ll have to reset this value everywhere, including fsglob_prot and isglob_prot, if the electron density of your solvent differs significantly from that of the pure H2O - that includes pure D2O) or if the temperature differs by more than ~10° from 20-25°C. You do not have to worry about this is you just have H2O with up to a 200 mM NaCl, for example.
Plot the data: press ENTER.
Once you see the plot, press ENTER.
Another set of parameters: type Y to over-ride the default.
Minimize again with new limits: type Y to over-ride the default
Minimum radius: type in whatever it shows as the average atomic radius (see in the program output above, should be close to 1.616 A for proteins).
Maximum atomic radius: repeat the same value.
Smax in the fitting range: press ENTER for default.
Minimum contrast in the shell: press ENTER for default.
Maximum contrast for the shell: here, you’ll have to run the program twice. Once you enter 0.0 when you are running the fit with no bound water, and the second time simply press ENTER for default when you allow dr to vary.
Minimum excluded volume: enter the value reported by CRYSOL as Displaced volume (see above)
Maximum excluded volume: re-enter the same value.
Plot the fit: press ENTER.
Once you see the plot, press ENTER.
Another set of parameters: press ENTER to terminate the program.  

The two *.fit files that CRYSOL will produce (while forcing dr=0.0 and while setting it free) will be used to extract the fitted curves produced by these runs (I use Excel for this). The fitted curve is the last column of the *.fit file. I normalize, in Excel, each of these two curves by the fitted values at q=0.0. Be careful here since, in addition to predicting the q=0.0 value (which you will need), CRYSOL will also predict intensities between q=0.0 and the qmin of your data file. So I normalize the curves first and then discard the lines corresponding to the spurious q values. The end result will be two curves that each start at 1.00 for q=0.0. and go to 10-2-10-4 at the high end of the typical q range. The dr=0.0 curve will need to be divided by the “floating dr” curve, point by point, and generate a two-column “bound water correction” file that has q as the first column and the ratio as the second column. If this is done properly, the correction will start at 1.00 at q=0.0 and will eventually reach values larger than 1.0 at the high end of the q-range and contain only the q values that are present in the experimental (sparsened) data file. This file will be needed as input to the Xplor/CNS structure refinement. This procedure will have to be repeated after every cycle of structure refinement. Keep an eye on the fitted drho value. As it converges, the surface water correction will converge as well. If the structure was over-expanded to begin with, you may see fitted dr starting at 0.0 and increasing to higher values after subsequent rounds of structure refinement. In this case, you will also see a decrease in the gyration radius of the macromolecule. The final bound layer density is often close to 1.1ro. When starting from an over-expanded model (and, thus from dr=0) one can end up with artificially low final values for dr. In these cases dr can be fixed at 1.1 ro from the beginning.

A. Grishaev, J. Ying, M. Canny, A. Pardi, & A. Bax “Solution structure of tRNAVal from joint refinement against dual alignment residual dipolar couplings and SAXS data”, J. Biomol. NMR (2008) 42, 99-109.

The script including the SAXS term is summarized below with the saxs-related parts highlighted in blue.

define( md.seed=823641; )
set seed=&md.seed end
parameter @dna-rna-allatom.param @axis.par end
structure @tRNA_val.mtf @tRNA_mod.mtf @axis1.mtf @axis2.mtf end
coor initialize end
coor @trnaval_newpf1_msa_3_1.pdb @tVt7.pdb
do (x=x-400.0) (segi TPHE)

! *********************************** SAXS data ***********************************
 ndat = 33        ! number of fitted data points
 kfor = 100.0   ! force constant
 I0   = 2.32      ! extrapolated zero angle intercept for the scattering intensity
 updf = 50       ! update frequency
 ngrd = 90       ! number of vectors on the equi-distant grid
 grid = spir      ! grid type - spiral
 segm = TVAL !  SEGID included in the saxs term
 nmin = 1         ! min RESID included in the saxs term
 nmax = 76      ! max RESID included in the saxs term
 @saxs_files_zeroC_075.dat ! control file for the data and correction terms


  flags exclude * include bond angl impr vdw ncs noe sani saxs end ! saxs data term has to be active
  igroup interaction (all) (all) weights * 1 end end


  evaluate ($filename="trnaval_newpf1_msa_4_1.pdb")
  set print=$filename end


print saxs end ! this will print saxs fitting info into the header of the pdb file


    write coordinates sele= (not segi TPHE ) output =$filename end


[ Home ] [ NIH ] [ NIDDK ] [ Disclaimer ] [ Copyright ]
last updated:  Nov 2009 / Webmaster