
Introduction
Conventional NMR protein structures are derived from large numbers of shortrange interHydrogen distances extracted from peak heights in NOE spectra. Since quantitative interpretation of NOE data requires prior knowledge of both the protein structure and its overall dynamics, NOE distances are generally used in a qualitative way during a typical structure determination (i.e. classes of shortdistance, medium distance, long distance). These shortrange distances are commonly supplemented by Jcoupling values, which are interactions taking place through chemical bonds. Since these interactions generally span only one, two, or three chemical bonds, they are also shortrange in nature. Jcouplings can be measured with great precision. However, they are interpreted on the basis of empirical calibration curves (Karplus curves) which relate Jcoupling values to molecular torsion angles. So, these relationships are often approximate, and they are commonly ambiguous, which is to say that a given observed Jcoupling might be consistent with two very different torsion angles. If we consider the complexity of this shortrange information, even for a modest protein fold like the 76residue ubiquitin, the difficulty of conventional NMR structure determination becomes apparent. Without counting stereospecific interactions, or interactions within residues, the Hatoms in ubiquitin comprise a network of more than 1400 interactions where nuclei are within 5 angstroms. It is this complicated network which must be characterized in some way in order to conduct a conventional NMR structure determination. So, in order to make NMR structure calculation simpler, we would like to find ways which do not require analysis of such a complex network of interactions. And, in order to make NMR structure calculation more precise, we would like to rely primarily on parameters which can be interpreted quantitatively. Chemical Shifts In the first stages of NMR structure determination, we commonly assign the chemical shifts of the backbone atoms. These chemical shifts are strongly correlated with residue type, as shown in the plot of CAlpha vs CBeta chemical shift values, colorized by amino acid type. At one extreme lies Ala residues, at the other, Ser and Thr. We can roughly compensate for residuetype differences by subtracting residuespecific random coil shift values, to generate secondary chemical shifts. An example is shown in this plot of CAlpha vs CBeta secondary chemical shift values colorized by residue type. As indicated in the figure, the secondary shift distribution is roughly similar for all residue types, and is not random. As a clue to the information contained in secondary chemical shifts, consider the plot of CAlpha vs CBeta secondary shift for ubiquitin residues, colorized by structural motif. As shown, secondary shifts from helical residues tend to have values which are different from secondary shifts of residues in betasheets. In other words, the backbone secondary chemical shifts contain information about the backbone structure. TALOS: Prediction of Backbone Angles In an attempt to exploit this secondary shift information quantitatively, we used a simple database mining approach, implemented in the TALOS system. In this system, we have a database of known highresolution structures and their measured chemical shifts. Then, given secondary chemical shifts of a triplet of residues in an unknown protein, we can search the database for triplets which have similar secondary shifts. If we find several good matches in the database, we can assume that the backbone angles of the central residues in a database triplet will be good predictors for the phi and psi angles in the unknown protein. In practice we assemble a list of the 10 best matches from the database, which currently contains ~180 proteins. Crossvalidation was used to characterize this database mining approach by testing how well each given known protein could be analyzed based on the remaining proteins in the database. For about 70% of the residues, there is a clear consensus of phi and psi values in the 10 best database matches, and in these "good" cases, the average and standard deviations of the phi and psi angles from the database are used as quantitative predictors for the backbone angles in the target protein. In the remaining cases, there is no consensus on phi and psi angles from the 10 closest database matches; these "ambiguous" cases are not used for prediction purposes. In practice, this TALOS database mining approach provides phi and psi estimates to better than 15 degrees RMS, although for about 2% of the TALOS predictions, the predicted angles are substantially different from the angles found in the reference structures. So, we characterize TALOS as having a 2% error rate in phi/psi prediction. Prediction of Backbone Chemical Shifts Given the information in the TALOS database, it is also possible to estimate backbone chemical shifts for a proposed structure. The simplest approach uses the database information to create Ramachandran surfaces of secondary chemical shift distribution with respect to phi and psi. Chemical shifts for a specific phi/psi value can be found simply by extracting the secondary shift from the given phi/psi point in the surface, and then adding a suitable random coil value. This simple approach, based on phi/psi values for individual residues, predicts backbone shifts with accuracies CAlpha: 1.12ppm, C Beta: 1.20ppm, C': 1.29ppm, N: 3.10ppm, HN: 0.67ppm, and HAlpha: 0.36ppm. Dipolar Couplings In an isotropically tumbling molecule, dipoledipole interactions are averaged to zero. But, if the molecule is in the presence of an aligned medium such as a liquid crystal, the molecule will interact with the aligned medium, and will no longer tumble isotropically. Then, dipoledipole interactions will no longer be averaged to zero, resulting in a dipolar coupling. These dipolar couplings can generally be measured by the same methods used to find Jcouplings. And, the mathematical form for dipolar couplings can be described exactly for a rigid molecule, as follows.
The residual dipolar splitting between spins A and B equals: D^{AB} = D^{AB}_{max} < P_{2}(cosq)> with P_{2}(x) = ^{1}/_{2}(3x^{2}  1). If the molecule is rigid, the orientation of the internuclear vector, r_{AB}, in an arbitrary molecular coordinate system can be described by the angles a_{x}, a_{y}, and a_{z} between the vector and the x, y, and z axis of the coordinate system. The angles b_{x}, b_{y}, and b_{z} define the instantaneous orientations of each of these axes relative to the static magnetic field. With cosq being the scalar product between a unit vector in the internuclear direction and a unit vector parallel to B_{o}, P_{2}(cosq) can be rewritten as: <P_{2}(cosq)> = ^{3}/_{2} < (cosb_{x}cosa_{x} + cosb_{y}cosa_{y }+ cosb_{z} cosa_{z})^{2}_{ }>  ^{1}/_{2} With C_{i} = cosb_{i} and c_{i} = cosa_{i}, this can be rewritten as: <P_{2}(cosq)> = ^{3}/_{2 }[ <C_{x}>^{2}c_{x}^{2} + <C_{y}>^{2}c_{y}^{2 }+ <C_{z}>^{2}c_{z}^{2 }+ 2<C_{x} C_{y}>c_{x}c_{y} + 2<C_{x} C_{z}>c_{x}c_{z} + 2<C_{y} C_{z}>c_{y}c_{z} ]  ^{1}/_{2} By writing S_{ij} = ^{3}/_{2 }<C_{i} C_{j}>  ^{1}/_{2 }d_{ij}, where d_{ij} is the Kronecker delta function, we obtain: <P_{2}(cosq)> = S_{i,j={x,y,z}} S_{ij} cosa_{i} cosa_{j} The 3x3 matrix S is commonly referred to as the Saupe matrix, the Saupe order matrix, or simply the order matrix. As <C_{x}>^{2}_{} + <C_{y}>^{2} + <C_{z}>^{2} = 1, the matrix S is traceless, and with <C_{i} C_{j}> = <C_{j} C_{i}>, S is also symmetric, and therefore only contains five independent elements. If the structure of the molecule is known, then the cosa_{i} direction cosine factors can be computed from the atomic coordinates of spins A and B. This is an important result, because it means that the five independent elements of the saupe matrix can generally be solved by linear least squares methods, provided that dipolar couplings for at least five internuclear vectors are available. However, if any pair of internuclear vectors is parallel, and for other special cases such as a set that includes three mutually orthogonal interactions, more measured couplings are required. For macromolecules, many more dipolar couplings are frequently measured, and S is overdetermined. Its elements are then commonly determined using singular value decomposition.
If the cartesian coordinates of spin A and spin B are {x_{A}, y_{A}, z_{A}} and {x_{B}, y_{B}, z_{B}} we can
define the direction cosines in terms of the coordinates as:
Then, the five coefficients s_{1} ... s_{5} can be determined by SVD using the following basis set:
where s is the estimated uncertainty
in the measured coupling. The measured dipolar couplings D^{AB}
are used to build a set of equations:
Given SVD solutions for coefficients s_{1} ... s_{5}, the elements of the order matrix S are:
The order matrix is real and symmetric, and it therefore is always possible to define a molecular axis system where S becomes diagonal. In a number of applications it can be advantageous to work in this principal axis frame, where: D^{AB}(a_{x}, a_{y}, a_{z}) = ^{3}/_{2 }D^{AB}_{max} {[ <C_{x}>^{2}c_{x}^{2} + <C_{y}>^{2}c_{y}^{2 }+ <C_{z}>^{2}c_{z}^{2}]  1}_{} where <C_{i}>^{2}_{ }corresponds to the probability of finding the ith axis parallel to the magnetic field. Only the relative differences in the <C_{i}>^{2} values contribute to the residual dipolar coupling. So, writing <C_{i}>^{2} = ^{1}/_{3} + A_{ii }, the coupling can be expressed in polar coordinates (q = a_{z}; c_{z} = cosq; c_{x} = sinq_{ }cosf; c_{y} = sinq_{ }sinf) to yield: D^{AB}(q,f) = ^{3}/_{2 }D^{AB}_{max} [cos^{2}q A_{zz} + sin^{2}q cos^{2}f A_{xx} + sin^{2}q sin^{2}f A_{yy}] Defining A_{zz} > A_{yy} > A_{xx}, and using A_{yy} + A_{xx} = A_{zz}; 2sin^{2}f = 1  cos2f; and 2cos^{2}f = 1 + cos2f, this can be rewritten as: D^{AB}( q,f) = ^{3}/_{2 }D^{AB}_{max} [P_{2}(cosq) A_{zz} + ^{1}/_{2}sin^{2}q cos2f (A_{xx}  A_{yy})] This leads to an expression of dipolar couplings in terms of alignment tensor parameters. From a graphical point of view, the alignment tensor can be visualized in terms of a 3D ellipsoid whose orientation corresponds with the axes of alignment, with the dimensions of the ellipsoid along each axis being A_{zz}, A_{yy}, and A_{xx}. As noted above, by definition A_{zz} is the longest axis, A_{yy} the next longest, and A_{xx} the smallest. Defining an axial component of the alignment tensor A_{a} = ^{3}/_{2}A_{zz}, and a rhombic component, A_{r} = (A_{xx}  A_{yy}), results in: D^{AB}(q,f) = D^{AB}_{max} [P_{2}(cosq) A_{a} + ^{3}/_{4} A_{r} sin^{2}q cos2f] Note that the maximum value for <C_{i}>^{2}_{ }is one, i.e., the maximum for A_{zz} equals 2/3, and the maximum value for A_{a} becomes one when the z axis of the principal alignment tensor becomes fully aligned with the static field. The above expression is often rewritten as: D^{AB}(q,f) = D^{AB}_{a} [(3cos^{2}q  1) + ^{3}/_{2} R sin^{2}q cos2f] where D^{AB}_{a} = ^{1}/_{2}D^{AB}_{max} is referred to as the magnitude of the dipolar coupling tensor, which describes how strongly aligned the molecular system is, and R = A_{r}/A_{a} is the rhombicity, which is the departure of alignment from axial symmetry. When the molecular system has been rotated so that its coordinate axes correspond with the alignment tensor axes, then the dipolar coupling can be computed as follows (this is the form used for our fitting of dipolar couplings by nonlinear least squares, for cases where there are restraints on one or more tensor parameters): D^{AB} = D^{AB}_{max} [D_{axial}(3Z^{AB}Z^{AB}  1) + ^{3}/_{2} D_{rhombic} (X^{AB}X^{AB}  Y^{AB}Y^{AB})] A critical aspect of the dipolar couplings is their dependence on cos^{2}q, which in practice means that there are two continuous ranges of orientation for the internuclear coupling vector which are consistent with a given coupling value, and they are mirror images of each other. A simple way to reduce this ambiguity is to prepare two different types of aligned media, for example a neutral one, and a charged one. The nature of interaction between the target molecular and the alignment media will be different in the two cases, resulting in two different and independent alignment systems. This restricts the orientations to only those positions which are consistent with both alignment tensors simultaneously. Then, only the intersecting orientations will be consistent with coupling values from both samples. However, there will still generally be cases that more than one conformation is consistent with the dipolar data. Within the context of a protein, residues are arranged in preferred orientations relative to each other, and in most cases, only one of these will be consistent with the collection of dipolar couplings. So, one way to reduce the impact of ambiguity is to only consider physically realistic protein conformations. To help resolve potential ambiguity still further, we can employ secondary structure information from chemical shifts. The molecular alignment frame serves as a reference system which establishes the relative orientation of one internuclear coupling vector with respect to any other, regardless of how far apart these internuclear vectors may be. For example, in the case of HNN couplings, dipolar couplings tell about the orientation of an HNN bond vector relative to any other HNN bond vector. This longrange orientational information is very different in nature from the shortrange distance and torsion information traditionally used in NMR structure calculation, and as we will show, it is a powerful complement to shortrange data. Direct Applications of Dipolar Couplings As already explained, dipolar coupling values are determined by orientation. So, dipolar couplings for two parallel bond vectors should be identical, given scaling factors to compensate for bond distance and magnetogyric ratios. This can form the basis for some useful analysis. For example, side chain orientation can be estimated by testing how dipolar couplings in the sidechain are correlated to parallel bonds in the backbone. Similarly, relative stereochemistry for molecules with several stereochemical centers can be identified by testing agreement of observed vs calculated dipolar couplings for all possible stereoisomers, and finding the stereoisomer with the best agreement between observed and calculated dipolar couplings. Dipolar couplings can be extremely sensitive to small changes in molecular coordinates, and they can be used directly in conventional structure determination. For example, consider the initial structure of a 69residue protein fragment which has a ~7 Hz RMSD between observed and calculated HNN dipolar couplings. This structure can be refined by conventional simulated annealing so that the couplings match to better than 1 Hz RMSD, but the refined backbone structure differs by less than 0.3 angstroms RMSD from the initial structure. As such, dipolar couplings can reveal structural details which might be difficult to characterize with NOE distances alone, for example the curvature of an isolated helix, as in the case of micellebound alphasynuclein. Structure Determination from Dipolar Couplings as the Primary Data As noted above, a critical aspect of the dipolar couplings is their dependence on cos^{2}q which in practice means that there are two continuous ranges of orientation for the internuclear coupling vector which are consistent with a given coupling value, and they are mirror images of each other. As also noted, a simple way to reduce this ambiguity is to introduce couplings measured at another alignment tensor, which restricts the orientations to only those positions which are consistent with both alignment tensors simultaneously. Then, only the intersecting orientations will be consistent with coupling values from both samples. This greatly reduces the ambiguity of dipolar couplings, but does not eliminate it. It can be noted that bond vectors in a protein are not oriented randomly or uniformly. For example HNN bond vector orientation surfaces for ubiquitin and DinI proteins show that the distributions are systematic, and also different for the two proteins. This argues that when analyzing data for a sequence of many residues simultaneously, it is not strictly necessary to consider every possible orientation of individual bond vectors when exploiting dipolar couplings for protein structure determination. Since protein residues are arranged in preferred orientations relative to each other, one way to reduce the impact of ambiguity is to consider only physically realistic protein conformations. And, to help resolve potential ambiguity still further, we can employ structure information from other NMR observables such as chemical shifts in combination with dipolar coupling data. This leads to the database mining approach called Molecular Fragment Replacement (MFR). Molecular Fragment Replacement (MFR) The central concept of MFR is straightforward; identify short fragments of known highresolution protein structures whose simulated NMR parameters are a good match for the observed NMR parameters of the target protein. Then, use suitable methods to assemble these fragments into larger elements of protein structure. The NMR parameters can include any combination of chemical shifts, dipolar couplings, Jcouplings, sequential NOEs, etc, as well as residuetype homology. For each individual parameter, such as chemical shift, a score is computed on the basis of the RMS differences between observed and predicted values. A linear combination of these individual scores is used to form an overall MFR score which is used to rank the fragments. In practice, MFR uses fragment sizes of 510 residues, currently drawn from a subset of roughly 850 structures in the PDB database, all with resolutions better than 2.4 angstroms. This provides a collection of more than 180,000 fragments, which is large enough to ensure that all physically realistic short fragment structures are represented. This means that MFR database mining can be still be applied to novel proteins which have no known homologous structure in the PDB. Our first proofofconcept result is shown in the MFR search results for residues 116 of ubiquitin, where we tallied the three best matching fragments in terms of simulated versus measured chemical shifts and dipolar couplings. Dipolar couplings for HNN, HNC', NC', and HACA measured in two alignment media were employed. Using this data in an MFR search, the three database fragments with the best MFR scores all match the backbone structure of ubiquitin to better than 1 angstrom RMSD. In a typical MFR search, we find overlapping collections of fragments which are best matches (i.e. lowest scores) according to the MFR scoring procedure. So for example, using a 7residue fragment size, an MFR search will identify the 10 database fragments which are the best match for residues 17 in the target protein, the 10 best matches for residues 28 in the target protein, etc. Such a collection of fragments can be visualized as a Ramachandran trajectory where each fragment is represented as a collection of vectors connecting that fragment's phi,psi backbone angles on a series of ramachandran surfaces for the target residue sequence. The ramachandran trajectory of MFR database fragments for ubiquitin is a good illustration of typical MFR results where chemical shifts are used in combination with dipolar couplings from two aligned media. In such results, there are two notable aspects. First, for the majority of residues where measured NMR parameters are available, MFR provides unambiguous indication of the phi,psi conformation. Second, even in regions where the MFR results show structural diversity, there are generally always some fragments which are good representatives of the ideal structure. This argues that the MFR search results should be a powerful and effective precursor to structure determination, given suitable protocols for converting MFR results into complete structures. However, it must be noted that it is not easy to use phi,psi conformations directly to build an entire protein structure. For example, if the exact phi,psi backbone angles from the crystal structure of ubiquitin are superimposed onto an ideal planar protein backbone geometry, the new structure differs from the original by more than 4 angstroms RMS. Nevertheless, it is possible to use MFR phi,psi values directly to build and refine elements of structure from 10 to 50 amino acids long, without the use of any distance restraints. In the ideal case of ubiquitin, where 4 types of highquality dipolar couplings are available in two media for almost all residues, the entire protein fold can be determined from dipolar couplings and chemical shifts alone, to a backbone RMS of better than 1 angstrom. MFR Alignment Tensor Estimates without Prior Knowledge of Structure As noted above, most of the fragments from an MFR search are good representatives of the ideal structure of the target. This means that the tensor magnitude and rhombicity estimates from the best MFR fragments should be good estimates for the alignment tensor parameters of the entire intact protein. These estimates can be used in later structure refinement steps. The predicted tensor parameters for the protein as a whole are calculated from a weighted average of the tensor parameters from the entire collection of MFR fragments. The weighting is performed according to the structural consensus over each given range of residues. This method can estimate tensor magnitudes and rhombicities to 0.5 Hz relative to an HNN coupling. In the case where two alignment media are available, the MFR results can likewise be used to compute the relative difference in orientation between the two tensors, which can also be used as a restraint for later structure refinement. Dynamics Information from MFR In the case of a flexible backbone structure, the local tensor magnitude is scaled down by the internal dynamics order parameter. So, a simple plot of MFR fragment tensor magnitude estimates versus fragment starting residue will reveal the location of flexible regions as places in the graph where the tensor magnitude drops. This same approach can also be used to identify cases where domains within a protein have different alignment tensors. GammaS Crystallin: A Practical Application of MFR The MFR application of GammaS crystallin serves as an example of practical, highquality structure determination based primarily on orientational restraints, supplemented by relatively small numbers of easytoassign NOE distances. GammaS is 177 residue protein with two similar domains, for which a homologous structure (GammaB) with 50% sequence identity is known. Dipolar coupling data in two media were measured. In one medium, measurements included 144 HNN, 111 CACB, 150 CAC', and 134 NC' dipolar couplings. In the second medium, measurements included 147 HNN, 135 CACB, 153 CAC', and 139 NC' dipolar couplings. Conformational exchange resulted in missing amide signals for one residue in the Nterminal domain, and nine residues in the Cterminal domain, so that most of the "missing" coupling data is associated with the Cterminal domain. Sidechain c1 angles were estimated from from ^{3}JNCg and ^{3}JC'Cg couplings, and c2 angles from ^{3}JCgCd. A deuterated sample was used to obtain 179 AmideAmide NOEs, however none of these represented interdomain contacts. So, a GammaS sample with 13C labeling for methyl sidechains only was used to obtain 70 MethylMethyl NOEs, and these included 6 interdomain distances. Using this data, the MFR search was conducted in two stages followed by a fragment refinement step. In the first MFR search, dipolar couplings were fit to database fragments using SVD, which is the default approach, and computationally fast. In this case, the tensor magnitudes and rhombicities are allowed to assume any value. So, in some cases, fragments with a "nonideal" shape can be made to match the measured dipolar couplings by using tensor parameters which are not truly representative of the target protein. To improve this situation, the results of the first MFR search are using to estimate reasonable tensor magnitude and rhombicity for the two alignment media. Then, a second MFR search is performed, this time with the tensor parameters held fixed at the estimated values. This requires use of nonlinear leastsquares fitting of the dipolar couplings, which is much slower. However, this second search with restrained tensor parameters leads to a collection of fragments with fewer ambiguities. Finally, the fragments identified by this second MFR search are subject to conventional lowtemperature simulated annealing refinement, to make small adjustments in the fragments which improve their overall agreement with the measured dipolar couplings. In the case of GammaS, the resultant collection of refined fragments has unambiguous phi,psi conformations for 90% of the residues. Furthermore, these fragments have a high amount of structural consensus, such that 50% of the residues have better than 5 degree RMS phi,psi consensus, and 33% of the residues have better than 3 degree RMS phi,psi consensus. At this point, we have a collection of structural data which is more or less ideal for 90% of the residues, but we don't know in advance which residues are the "ideal" ones. We can use a simple modification of a traditional annealing scheme to employ this data. First, all residues the MFR results are converted into phi,psi restraints for all residues where there is a consensus phi,psi conformation in all refined fragments. Then, these restraints are used in a conventional simulated annealing protocol, along with NOE distances, and the original dipolar couplings themselves. In the hightemperature phase, the force constants for the MFRderived torsion restraints are held high, so that the MFR results maintain the local structure at early stages of the structure calculation. During cooling, the force constant of the MFR torsions is reduced, while the force constant for the NOEs and individual dipolar couplings is increased. So, as the structure approaches its ideal fold, the MFR torsion restraints become less important, and the individual dipolar couplings become more important in maintaining the local structure. This allows the final structure a chance to overcome any incorrect conformations in the MFR torsion restraints. In this case, the final MFRderived structure of GammaS agrees with its homolog GammaB to 0.63 angstroms RMS for the Nterminal domain backbone, and 1.09 angstroms for the Cterminal domain. In particular, the Nterminal agreement is among the best between any NMR structure and homolog. Summary Dipolar couplings and MFR database mining form the basis for a new approach to structure determination based primarily on quantitative orientational restraints, supplemented by small numbers of NOE distances. The MFR approach can also be used to estimate tensor parameters without prior knowledge of the structure, and to probe the dynamics of the molecular system. In our applications so far, the MFR approach has been quicker than conventional NMR structure determination, and has yielded better quality structures. It has also provided structural information for systems where NOE data is not obtainable or is not revealing. 