Effects of chemical shift assignments quality on CS-ROSETTA structure generation

De novo protein structure generation from incomplete chemical shift assignments

Yang Shen¹, Robert Vernon², David Baker² and Ad Bax¹

¹ Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892-0520, U.S.A.² Department of Biochemistry and Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195

Abstract

NMR chemical shifts provide important local structural information for proteins. Consistent structure generation from NMR chemical shift data has recently become feasible for proteins with sizes of up to 130 residues, and such structures are of a quality comparable to those obtained with the standard NMR protocol. This study investigates the influence of the completeness of chemical shift assignments on structures generated from chemical shifts. The Chemical-Shift-Rosetta (CS-Rosetta) protocol was used for de novo protein structure generation with various degrees of completeness of the chemical shift assignment, simulated by omission of entries in the experimental chemical shift data previously used for the initial demonstration of the CS-Rosetta approach. In addition, a new CS-Rosetta protocol is described that improves robustness of the method for proteins with missing or erroneous NMR chemical shift input data. This strategy, which uses traditional Rosetta for pre-filtering of the fragment selection process, is demonstrated for two paramagnetic proteins and also for two proteins with solid-state NMR chemical shift assignments.

Key words: NMR chemical shift; protein structure prediction; solid-state NMR structure determination; paramagnetic protein; CS-Rosetta

Chemical shifts are key to protein NMR spectroscopy not only because they allow separate observation of each ¹H, ¹³C, and ¹⁵N nucleus in the molecule, but also as they carry important information on the local conformation (Saito 1986; Spera and Bax 1991; Williamson and Asakura 1993; Williamson et al. 1995; Asakura et al. 1997; Ando et al. 1998; Cornilescu et al. 1999; Castellani et al. 2003; Neal et al. 2006), including secondary structure (Wishart et al. 1991), hydrogen bonding (Wagner et al. 1983; Shen and Bax 2007) and the position and orientation of aromatic rings (Haigh and Mallion 1979; Case 1995). Protein structural information derived from chemical shifts, such as the backbone f/y torsion angles predicted by the program TALOS (Cornilescu et al. 1999), is widely used in NMR structure determination, but almost invariably as a complement to conventional NOE distance restraints or to internuclear distance restraints obtained by solid-state NMR. Recently, several computational approaches have been developed to use the NMR chemical shifts alone as input for protein structure generation (Cavalli et al. 2007; Gong et al. 2007; Shen et al. 2008; Wishart et al. 2008). These approaches, represented by CHESHIRE (Cavalli et al. 2007), CS-Rosetta (Shen et al. 2008) and CS23D (Wishart et al. 2008), match the experimental chemical shifts of the backbone and ¹³C^b atoms, which are commonly available at the early stage of the conventional NMR structure determination procedure, to a structural database to identify protein fragments with similar chemical shifts. Because the structural database of proteins for which actual NMR assignments are available remains relatively small, empirical relationships (Cornilescu et al. 1999; Neal et al. 2003; Kontaxis et al. 2005; Shen and Bax 2007) are commonly used to "assign" chemical shift values to nuclei in proteins of known structure. Selected protein fragments are then used as input for a fragment assembly procedure, which also aims to optimize empirical energy terms related to hydrogen bonding, hydrophobic packing, etc., to generate an all-atom protein structure. These approaches have been evaluated for over two dozen proteins with sizes of up to 15 kD and a wide variety of folds. For the vast majority, convergence is obtained, which then invariably yields all-atom protein models that compare well with experimental structures, with root-mean-square deviations (rmsd's) from the conventionally determined reference structure in the 0.7-2 angstrom range for the backbone atoms, and ~1.4-3 angstrom when considering all atoms. Structures generated by the CS-Rosetta procedure for nine structural genomics target proteins, prior to completion of the conventional NMR structure determination process (Shen et al. 2008), prove the procedure to be a viable alternative for small to medium-size proteins (Gryk and Hoch 2008).

To date, the chemical shift based structure determination methods have been evaluated for proteins with complete or nearly complete NMR chemical shift assignments. In practice, however, resonance assignments are often incomplete, and also may contain a small fraction of erroneous assignments. Often, a completeness of >80-90% of the backbone sequence-specific assignments makes it possible to obtain a sufficient number of side-chain resonance and NOE assignments for deriving a dense network of distance restraints, needed for the conventional NMR structure determination procedure. The present study investigates the impact of incomplete chemical shift assignments on the NMR chemical shift based CS-Rosetta protocol by using chemical shift assignments with various degrees of completeness or correctness, simulated by omission and/or modification of entries in the experimental chemical shift data. For cases where a substantial fraction of the chemical shifts is missing or in error and the standard fragment CS-Rosetta protocol is found to fail, a more robust hybrid fragment selection method is described which largely resolves this limitation.

In recent years, several viable routes to resonance assignment and structure determination of small globular proteins by solid-state NMR (ssNMR) have been demonstrated (Castellani et al. 2002; Igumenova et al. 2004; Siemer et al. 2005; Zech et al. 2005; Nadaud et al. 2007; Loquet et al. 2008; Manolikas et al. 2008), relying mostly on ¹³C-¹³C, ¹⁵N-¹³C, and/or indirectly measured ¹H-¹H distance restraints. Chemical shift assignments of ssNMR spectra typically are obtained by sophisticated two-and three-dimensional ¹³C-detected analogs of the widely used triple resonance J-connectivity experiments used in solution NMR. However, with few exceptions (Agarwal et al. 2006; Chevelkov et al. 2006), ¹H resonance assignments are usually not determined when studying a protein structure by these methods. For a variety of technical reasons, spectral resolution obtained for small proteins by ssNMR is often lower than what can be obtained for such proteins in solution (Tycko 1996), resulting in increased signal overlap and a considerable fraction of missing resonance assignments. For cases where protein structures have been determined both by solution and by solid state NMR methods, results are generally quite similar (Manolikas et al. 2008), and chemical shifts observed in the solid state generally agree well with those seen in solution (Igumenova et al. 2004; Zech et al. 2005). On the other hand, exceptions are often seen for residues involved in intermolecular contacts, i.e., surface-exposed residues, reflecting the different protein sample conditions. It is therefore interesting to evaluate to what extent the CS-Rosetta approach is applicable to proteins whose chemical shifts have been determined by solid state NMR. Indeed, as we demonstrate for two small proteins, ubiquitin and GB3, CS-Rosetta yields good structural models when using solely the ssNMR chemical shift assignments as input.

A second challenging area, where often a considerable fraction of chemical shift assignments are missing, concerns paramagnetic metalloproteins. About 25% of all proteins in living systems contain metal ions (Andreini et al. 2004) and in many of these cases the metal is paramagnetic (Fe^2+/3+, Cu²⁺, Co²⁺, Ni³⁺), where the presence of unpaired electrons causes very rapid transverse relaxation for nearby nuclei, interfering with use of the standard ¹H-detected triple resonance assignment strategy (Ikura et al. 1990; Montelione and Wagner 1990). Although ¹³C-detected experiments can yield relief (Montelione and Wagner 1990; Bertini et al. 2005; Bermel et al. 2006), collection of ¹H-¹H NOE restraints remains problematic in the vicinity of paramagnetic centers. The degree of paramagnetic broadening scales with the inverse sixth power of the distance to the metal, resulting in a sphere with radius of ca 5-15 Å around the metal where assignments are missing. In addition, if protons are observed and assigned, they may contain paramagnetic pseudo-contact contributions to their chemical shifts, which are not easily accounted for in the absence of a known structure, and therefore can impact the molecular fragment search of the CS-Rosetta protocol in a similar manner as assignment errors. We will show, however, that the hybrid CS-Rosetta protocol is quite tolerant to these problems, and demonstrate its application to two small paramagnetic proteins of known structure.

Methods and Materials

In this work, the original complete experimental chemical shift assignments, including d¹⁵N,d¹³C', d¹³C^a, d¹³C^b, d¹H^a and d¹H^N,for proteins MrR16 (90 residues; PDB code: 1YWX; 514 available chemical shifts from BMRB #6799) and TM1442 (110 residues; PDB code: 1SBO; 647 available chemical shifts from BMRB #5921) are used. The entries of the chemical shift assignments of each protein are regrouped and/or modified to create new datasets that simulate the chemical shift inputs with various degrees of completeness and/or chemical shift errors. The CS-Rosetta protein structure generation protocol is carried out for these differently prepared chemical shift input data sets, but following exactly the same computational procedures. The impact of the incompleteness and/or incorrectness of chemical shift assignments on the CS-Rosetta procedure are evaluated both by monitoring the accuracy of the selected fragments and by the quality and convergence of the generated CS-Rosetta all-atom models.

Preparation of chemical shift datasets

Three groups of incomplete or partially erroneous chemical shift assignments were generated using the original (nearly complete) experimental chemical shift assignments of proteins MrR16 and TM1442. Details regarding the assignments of the two paramagnetic proteins, and the proteins studied by solid-state NMR, are also provided below.

I. Simulated datasets with missing chemical shifts assignments for certain types of nuclei. Depending on the strategy used for backbone resonance assignment, chemical shift assignments for certain types of backbone may not be available. Table 1 lists the chemical shift datasets generated for MrR16 and TM1442 by omitting the entries of the experimental chemical shift assignments of up to four types of nuclei (represented by datasets Ia-Ii). Except for datasets Ig(containing d¹⁵N, d¹³C^a, d¹³C^b and d¹³C' for all residues) and Ii (d¹³C^a and d¹³C^b), these datasets all include d¹⁵N, d¹³C^a and d¹H^N, which constitute the minimum set of protein backbone chemical shifts required for conventional triple resonance assignment. Dataset Ig, containing only ¹³C and ¹⁵N chemical shifts, was generated to simulate a typical solid-state NMR chemical shift dataset. Dataset lk contains no chemical shifts and is included to allow comparison of the impact of chemical shifts over standard Rosetta fragment selection.

Table 1. Chemical shift datasets with partial assignment of backbone nuclei

Dataset name	d¹⁵N	d¹H^N	d¹³C^a	d¹³C^b	d¹³C’	d¹H^a
Ia	●	●	●	●	×	●
Ib	●	●	●	×	●	●
Ic	●	●	●	●	●	×
Id	●	●	●	×	×	●
Ie	●	●	●	●	×	×
If	●	●	●	×	●	×
Ig	●	×	●	●	●	×
Ih	●	●	●	×	×	×
Ii	×	×	●	●	×	×
Ij	●	●	●	●	●	●
Ik	×	×	×	×	×	×

● present chemical shifts absent chemical shifts

II. Datasets with unassigned residues. To simulate the situation of proteins with unassigned residues, for both MrR16 and TM1442 two sets of "incomplete" chemical shift assignments were generated by omitting all chemical shifts (d¹⁵N, d¹³C', d¹³C^a, d¹³C^b, d¹H^a and d¹H^N) for ~10% or ~20% of the residues from their original complete chemical shift datasets. Two different sets of partial chemical shift assignments were generated in this manner. First, a favorable but perhaps unrealistic set was generated where the unassigned residues are evenly distributed along the protein sequence by deleting chemical shifts of residue numbers N10 (data set IIa) or N5 (data set IIb), where N = 1,2,3,.... Second, two more realistic sets of partial assignments were generated where the unassigned residues are consecutive along the protein sequence, exemplifying the situation where residues of one or two segments in the protein are not assigned. Considering that such unassigned stretches of residues are often located in loop or turn regions, we arbitrarily selected such regions with length of ca 8-10% of the entire sequence, and removed their chemical shift assignments from the datasets. For MrR16, the deleted regions comprise two loops, residues 24-32 (between the second b-strand and the first a-helix, referred to as loop I) and 43-50 (which connects the first a-helix and the third b-strand, referred to as loop II); for TM1442, the two loops, comprise residues 21-29 (between the third b-strand and the first a-helix, loop I) and 52-59 (between the fourth b-strand and the second a-helix, loop II). For each protein, three chemical shift assignment datasets were generated and named as follows: dataset IIc, for which all assignments of loop I are omitted, simulating the situation that the residues in loop I are “unassigned”; dataset IId, for which the residues in loop II are "unassigned"; and dataset IIe, for which the residues in both loops I and II, comprising ~16-19% of the total number of residues in the protein, are "unassigned".

III. Simulated datasets with artificial errors. In practice, various kinds of chemical shift assignment errors can occur during the protein resonance assignment process, either resulting from mistakes during automated resonance assignment, or from human errors. In order to evaluate the impact of such "random" errors on the CS-Rosetta structure generation, several chemical shift assignment datasets were generated by swapping the chemical shift assignments for two dipeptides with identical amino acid types along the protein sequence: dataset IIIa, for which the chemical shift assignments of the two dipeptides have the same secondary structures (for MrR16, Val⁴²-Leu⁴³ and Val⁸⁵-Leu⁸⁶, both in a-helices; for TM1442, Ile¹⁶-Val¹⁷ and Ile⁴⁷-Val⁴⁸, both in b-strands) were swapped; dataset IIIb, for which the chemical shift assignments of two dipeptides in different secondary structures were swapped (for MrR16, Leu³⁹-Val⁴⁰ in the first a-helix, Leu⁵⁰-Val⁵¹ in the third b-strand; for TM1442, Ser⁵²-Ser⁵³ in the loop between the fourth b-strand and second a-helix, and Ser⁸²-Ser⁸³ in the last b-strand).

Chemical shift referencing errors also are common, and the resulting “artificial” chemical shift offsets are easily simulated by systematically altering chemical shifts of certain types of nuclei. Here, we evaluate two such datasets: IIIc, for which 1.0 ppm was added to all ¹³C^a and ¹³C^b chemical shifts as the artificial chemical shift referencing error; and dataset IIId, for which 1.7 ppm was added to all ¹³C^a and ¹³C^b chemical shifts.

IV. Experimental chemical shifts from solid state NMR. The ¹⁵N, ¹³C^a, ¹³C^b and ¹³C' chemical shifts of GB3 and ubiquitin, as determined by ssNMR spectroscopy, were taken from the BMRB (accession codes 15283 (Nadaud et al. 2007) and 7111 (Manolikas et al. 2008). For both proteins, the high-resolution solution NMR structures (PDB entries 2OED (Ulmer et al. 2003) and 1D3Z (Cornilescu et al. 1998)), respectively, were used as the experimental reference structures to evaluate the CS-Rosetta all-atom models.

V. Experimental chemical shifts from paramagnetic proteins. The ¹⁵N, ¹³C^a, ¹³C^b, ¹³C', ¹H^a and ¹H^N chemical shift assignments of two paramagnetic proteins, calbindin (75 residues; with a paramagnetic Yb³⁺ ion in the C-terminal metal binding site and Ca²⁺ in the N-terminal site) and ferredoxin (98 residues; with a [2Fe-2S] cofactor), were taken from BMRB entries 15594 (Barnwal et al. 2008) and 5148 (Muller et al. 2002), respectively. The experimental structure of calbindin is taken from a 1.6 Å X-ray structure (PDB entry 4ICB) of diamagnetic Ca²⁺-calbindin (Svensson et al. 1992); the NMR structure (PDB entry 1JQ4) of the [2Fe-2S] ferredoxin (Muller et al. 2002), for which the above NMR chemical shift assignments were obtained, is used as the experimental reference structure for this protein.

Protein fragment selection and structure generation protocols

The newly extended Rosetta protein structural database, comprising a total of 9,523 proteins, was supplemented with predicted ¹³C^α, ¹³C^β, ¹³C^', ¹⁵N, ¹H^α and ¹H^N chemical shifts by the program SPARTA (Shen and Bax 2007). Then, for each 3-residue and 9-residue fragment in the query protein, selection of database fragment candidates was performed in two different ways.

(1) standard MFR fragment selection: 200 fragment candidates with best matched backbone NMR chemical shifts and amino acid sequence patterns were selected by using a standard MFR search of the protein structural database (Kontaxis et al. 2005; Shen et al. 2008)

(2) hybrid fragment selection: As indicated in Fig. 1, an exhaustive search was first conducted throughout the protein structural database by using the standard Rosetta method (Rohl et al. 2004) to find the 2,000 database fragments with the best matched amino acid sequence and sequence-derived secondary structure patterns. A second search was then performed on these 2,000 fragment candidates to select the 200 fragments with the best matched chemical shifts pattern according to a chemical shift score of

[1]

as defined in Eq 3 of Shen et al. (2008), where d_i,j stands for the chemical shifts of atom i (i = ¹³C^a, ¹³C^b, ¹³C', ¹⁵N, ¹H^a and ¹H^N) of residue j in the fragment; is the experimental chemical shift in the target segment; and denote the SPARTA-derived chemical shifts and uncertainties, respectively, for the fragments in the protein structural database; N is the total number of chemical shifts in the fragment; c_i is the weighting factor for each atom type (1.0 for ¹³C^a, ¹³C^b, ¹³C^′, ¹H^a; 0.9 for ¹H^N and ¹⁵N). For all tests, proteins with significant sequence homology, as judged by a PSI-BLAST (Altschul et al. 1997) e-score < 0.05 to the target protein were excluded from the protein structural database before fragment searching. Note that this removal is only needed for the tests carried out in this study; in real applications the presence of homologous proteins will increase the quality of the resulting structures.

The selected fragments, represented by their idealized backbone torsion angles and the secondary structure classification for each residue, were used in the standard Rosetta manner as inputs for a Monte Carlo assembly and relaxation process to generate ca. 10,000 Rosetta all-atom models for each protein. These all-atom models were further evaluated in terms of fitness with respect to the input chemical shift data, following the same procedure used in the standard CS-Rosetta protocol (Shen et al. 2008), contributing to the empirical energy term that is used for the selection of final all-atom models.

All CS-Rosetta structure generations were performed using Rosetta@home (http://boinc.bakerlab.org/rosetta/) supported by the BOINC server or the Biowulf PC/Linux cluster at the NIH (http://biowulf.nih.gov).

Evaluation of CS-Rosetta structure generation

To evaluate the influence of the completeness of chemical shift assignments on the CS-Rosetta protein structure generation process, the following parameters are monitored and analyzed:

1. Average value for the coordinate rmsd between 200 selected fragments and the experimental coordinates of the corresponding target segment, representing the average accuracy or "quality" of the selected fragments.

2. Lowest coordinate rmsd of any of the 200 selected fragments relative to the experimental coordinates of the corresponding target segment, representing the accuracy or quality of the best fragment.

3. Raw Rosetta all-atom empirical energy of the assembled full-atom models.

4. Re-scored Rosetta all-atom empirical energy, which includes the agreement with the input chemical shift data (Shen et al. 2008).

5. C^a coordinate rmsd of Rosetta all-atom models relative to the experimental protein structure, representing the "accuracy" of the generated all-atom models.

6. C^a coordinate rmsd of the ten models with the lowest re-scored empirical Rosetta all-atom models relative to the model with the lowest energy, representing the convergence of the generated all-atom models. Clustering of these lowest energy models within ~2.0 Å from the model with the lowest energy, or within ~2.0 Å from the experimental structure, is taken as the criterion for a successful prediction (Shen et al. 2008).

7. Distribution of energies and C^a coordinate rmsds of the Rosetta models relative to the lowest re-scored empirical Rosetta all-atom model, illustrating the convergence of the CS-Rosetta procedure.

Results and discussion

During the CS-Rosetta structure generation, the input chemical shifts serve two major functions: fragment selection and re-scoring of the Rosetta models (Fig. 1). Use of the chemical shift information during the fragment search process significantly increases the accuracy of selected fragments over the use of sequence information alone (Shen et al. 2008), and dramatically improves convergence of the structure generation process. Evaluation of the agreement between the final Rosetta-generated models and the input experimental chemical shifts also provides an important selection criterion for eliminating structures whose backbone angles have diverged from those of the original input fragments during the Rosetta optimization procedure. In practice, frequently not all chemical shifts (d¹⁵N, d¹³C', d¹³C^a, d¹³C^b, d¹H^a and d¹H^N) of all residues are available, depending on the resonance assignment strategy chosen and/or missing connectivities in the assignment pathway, most often resulting from conformational exchange on an intermediate time scale. The completeness of the chemical shift assignment will impact both the fragment selection and the re-scoring steps, and thereby the entire CS-Rosetta structure generation procedure. The impact of missing chemical shifts on each of these steps will be discussed below.

Absence of assignments for certain types of nuclei

It is well recognized that secondary chemical shifts of different nuclei in any given residue are correlated (Supplementary Fig. S1), and this correlation can be used effectively to identify potential errors in chemical shift referencing (Wang et al. 2005). The structural information contained in the chemical shifts of the different types of backbone nuclei therefore may be partly redundant. The standard CS-Rosetta protocol utilizes chemical shifts of all backbone and ¹³C^b atoms to select the best matched 3-residue and 9-residue fragments. This redundancy suggests that the absence of assignments for some of this set of six nuclei (N, H^N, C^a, H^a, C^b, C') may not significantly decrease the accuracy of the selected fragments. This issue will be evaluated below for the chemical shift combinations listed in Table 1.

Omission of a single type of chemical shift (d¹³C', d¹³C^b or d¹H^a) is found to have very little adverse impact on the quality of selected fragments (Fig. S2A-C), either when using the regular MFR selection protocol or the hybrid method. There also appears little systematic difference in the accuracy of fragments selected with the regular MFR protocol or the hybrid method when using these sets of chemical shifts, although the individual sets of fragments selected by the two methods can differ substantially. This holds true both when considering the average backbone rmsd relative to the reference structure, and for the rmsd of the fragment most closely matching the reference structure (Fig. S2). In passing, we note that the moderate differences in the quality of the fragments are not that easy to evaluate from Figures such as S2, but these differences propagate during the Monte Carlo Rosetta structure generation process, dramatically impacting the yield of converged structures.

Although the accuracy of the fragments selected when omitting two types of chemical shifts (either d¹³C'/d¹³C^b, d¹³C'/d¹H^a, d¹H^a/d¹³C^b, or d¹H^N/d¹H^a) decreases somewhat (Fig. S2D-G), this decrease is small compared to the variation in accuracy seen for different fragments along the sequence of the two proteins.

For MrR16, the quality of fragments obtained by using the chemical shift assignments of only ¹H^N, ¹⁵N, and ¹³C^a, or sets containing only d¹³C^a and d¹³C^b is not much lower than for sets derived using more complete assignments (Fig. S2). As a result, the convergence of the CS-Rosetta structure generation process remains adequate and permits assembly of reasonable Rosetta models, albeit with raw Rosetta all-atom energies that are not as low as for structures obtained from using all six types of chemical shifts for fragment searching (Fig. S3). Similar results are obtained for TM1442 (Fig. 2).

Remarkably, even though the accuracy of the resulting structures decreases when just using ¹H^N, ¹⁵N, and ¹³C^a chemical shifts, or just ¹³C^a and ¹³C^b chemical shifts, lowest energy structures remain close to the reference structure, in particular when the hybrid fragment selection method is used. A survey of the energies of the Rosetta-assembled structures and their accuracies (Fig. S3) indicates that the original MFR fragment selection results in higher yields during structure generation than the hybrid fragment selection method when assignments are relatively complete. However, for MrR16, the hybrid method outperforms the regular MFR method for datasets Id, If and Ih (Fig. S3); for TM1442 the hybrid method outperforms the regular MFR approach for datasets 1B, 1F, and 1H (Fig. S4). For the case where no chemical shifts are available, only the standard Rosetta approach can be used. No convergence is then reached for MrR16, whereas for TM1442 the lowest energy models fall within 4 Å from the reference structure and relaxed convergence criteria are met (Fig. S5).

The calculations discussed above, and summarized in Figures 2 and S2-S5 indicate that the resonance assignments of not all six types of nuclei are required for success of the CS-Rosetta structure generation process. The order of importance of each type of chemical shift can be ranked as d¹³C^a ~ d¹³C^b > d¹H^a ~ d¹³C^' > d¹⁵N ~ d¹H^N. For proteins where all or the vast majority of these chemical shifts are available, the standard MFR fragment selection protocol tends to yield better accuracy of the selected MFR fragments and higher convergence, as well as lower energy when generating the all-atom Rosetta models. The calculations also suggest that the chemical shift assignment dataset needed for the CS-Rosetta protocol at a minimum comprises d¹⁵N, d¹H^Nand d¹³C^a, which also are the cornerstone nuclei during triple resonance backbone assignment, complemented by either d¹³C', d¹³C^b or d¹H^a.

Absence of assignments for subsets of residues

The standard MFR fragment selection procedure, implemented in the previously described CS-Rosetta protocol, relies primarily on the match between the experimental ¹³C^a, ¹³C^b, ¹³C', ¹⁵N, ¹H^N and ¹H^a secondary chemical shift values of each residue in any given 3- or 9-residue query fragment, and the SPARTA-generated secondary shift values for the corresponding residues in any fragment present in the structural database. The similarity in amino acid sequence is also used in this scoring process but carries a much weaker weighting. However, when most or all chemical shifts are missing for any given residue or group of residues, the relative importance of similarity in residue type increases and eventually becomes the only criterion when no chemical shifts are available at all. Clearly, the absence of chemical shift information yields to a decrease in accuracy of the fragments that can optimally be selected from any structural database (Shen et al. 2008).

For the relatively favorable situation, where residues with missing chemical shifts are distributed evenly throughout the protein sequence, the chemical shift patterns encoded in the 9-residue target fragments only sustain a small fractional loss in information content when a single residue in such a fragment is missing. Indeed the quality of the MFR-selected fragment candidates for chemical shift assignment IIa and IIb (see Preparation of chemical shift datasets section) remains quite good (Fig. S6). For the 3-residue fragments, where the loss of assignments for one residue represents 33% loss in information contents, results are less favorable. In particular, when the backbone angles within the 3-residue fragment strongly differ from one another, i.e., when the fragment is not embedded in an a-helix or b-strand, results from the MFR search can be poor. For example, for the 3-residue TM1442 fragments containing residue Lys⁸⁵ (an N-terminal helix capping residue), omission of its chemical shifts (dataset IIb), causes a large spike in the coordinate rmsd when using the regular MFR fragment search (Fig. S6B′′). Nevertheless, because the adverse impact of lacking chemical shift assignments on the quality of the 9-residue fragments remains small, the Rosetta fragment assembly process remains capable of generating high quality models. This result applies for both MrR16 and TM1442 (Figs. S7 and S8), but for both proteins convergence to the correct structure is lower compared to using a complete set of chemical shift assignments.

A more realistic but also more challenging situation occurs when the unassigned residues cluster along the protein sequence. The MFR fragment selection then becomes dominated by residue type similarity between the query fragment and fragments present in the structural database. The accuracy of fragments that include such unassigned segments, selected by the standard MFR method, is severely affected (Fig. S6C,D), in particular when the missing assignments are located outside regions of secondary structure (datasets IIc and IId). Interestingly, the quality of these fragments tends to be much lower than what is achieved with the standard Rosetta fragment selection method (Fig. S5), highlighting that the simple residue similarity scoring used by the MFR method performs much worse than the far more elaborate Rosetta fragment selection protocol (Rohl et al. 2004). Unsurprisingly, the subsequent Rosetta structure assembly protocol, using standard MFR fragments as input, can fail to obtain a converged low-energy fold (Figs. S7, S8). On the other hand, for MrR16 the CS-Rosetta structure generation for dataset Ic, lacking assignments for residues 24-32, remains successful and finds a converged low-energy fold, where the backbone of the lowest energy model deviates by 1.8 Å from the experimental reference structure (Fig. S7). Even while the quality of 9-residue fragments encompassing this region with missing assignments is poor, the accuracy of the best 3-residue fragments selected remains quite good for this region, and it is the powerful combinatorial engine of Rosetta which can exploit the presence of a relatively small subset of accurate fragments for this single region during the assembly process. For the case where two regions with missing assignment are present in the protein (dataset IIe), CS-Rosetta with standard MFR selection no longer is able to obtain converged low energy structures (Figs. 3, S7 and S8).

One way to improve the selection of suitable fragments, and thereby the CS-Rosetta structure generation process, for proteins with extended segments of missing chemical shift assignments is to take advantage of the standard Rosetta fragment selection procedure (Rohl et al. 2004), which searches for matched database fragments based on a relatively sophisticated procedure that simultaneously exploits residue type similarity and predicted secondary structure. Amino acid sequence similarity alone provides less structural information than the backbone chemical shifts, and therefore results in a wider distribution of selected peptide conformations. The average quality of Rosetta-selected fragments therefore is significantly lower than for MFR selection based on chemical shifts, but the quality of the best fragments (out of 200 selected) remains quite good, in particular for the 3-residue fragments (Shen et al. 2008). A preferred way to score the fragments therefore would directly combine, with suitable weight factors, the amino acid sequence based Rosetta fragment score with the chemical shift component of the MFR score. For technical reasons, however, this is not easily accomplished and we therefore resort to a simpler protocol which equally takes advantage of the strengths of both approaches. This hybrid fragment selection procedure first uses standard Rosetta to select the 2000 database fragments (out of over 2,200,000) that are most compatible in terms of amino acid sequence, and then uses MFR chemical shift scoring to narrow down this set to fragments that are most compatible with the experimental shifts. When complete chemical shifts are available, this hybrid method performs slightly worse than the regular MFR procedure (Figs. S2-S4). However, when significant segments in the protein lack assignments, the hybrid method remains perfectly successful at generating low energy, converged and accurate results. For example, when using the ‘hybrid’ fragments selected with chemical shift datasets IIc-IIe, lacking chemical shifts for two extended loop regions, the Rosetta fragment assembly and relaxation protocol results in near-convergence for TM1442, yielding lowest energy models that are within 2.5 Å C^a rmsd relative to the reference structure (Fig.3).

Impact of chemical shift assignment errors

A potential error during conventional and/or automated backbone resonance assignments is the case where chemical shift assignments of two di- or tripeptide sequences of similar amino acid sequence, embedded between residues with similar chemical shifts, are accidentally interchanged. Below, we consider the case where assignments for two dipeptides with identical amino acid types are interchanged.

For the favorable situation where the two dipeptides are located in segments with the same secondary structure, as exemplified in dataset IIIa, the chemical shift patterns in the 3-residue and 9-residue fragments are virtually unchanged and there is essentially no adverse impact on the fragment selection, neither for the standard MFR nor the hybrid approach (Fig. S9). Clearly, generation of Rosetta structures also remains unaffected (Figs. S10 and S11).

For the case where the two miss-assigned dipeptides are engaged in different types of secondary structure, the incorrect chemical shift values are likely to favor selection of fragments with backbone torsion angles that deviate substantially from the true values, resulting in a significant decrease in the quality of MFR-selected fragments. This is particularly true for the 3-residue fragments (Fig. 4A,B; Fig. S9), where the fraction of erroneous assignments equals two thirds. Not surprisingly, the subsequent Rosetta fragment assembly and relaxation protocol has trouble generating well converged models. Although the lowest (re-scored) energy models exhibit folds that are essentially correct, these differ by ~2.56 Å and ~3.47 Å (C^a-rmsd) from the experimental structures of MrR16 and TM1442, respectively (Fig. 4C; Figs. S10 and S11). The re-scored all-atom energies of these models are also systematically higher than obtained when using the correct chemical shift assignments.

When using the hybrid fragment selection method, the impact of erroneous assignments is reduced considerably, and acceptable convergence is achieved (Fig. 4; Figs. S9-S11).

As pointed out by Wang et al. (2005), nearly 30% of the deposited chemical shift data in the BMRB have chemical shift referencing problems. Such referencing errors are most prevalent for ¹³C^a/¹³C^b, but also are common for ¹³C' and ¹⁵N. Below, we evaluate the impact of ¹³C^a/¹³C^b referencing errors. As will be shown, the fragment search procedure is relatively insensitive to moderate errors in ¹³C^a/¹³C^b chemical shift referencing, in part because ¹³C^a and ¹³C^b secondary shifts are anti-correlated. For example, a 4 ppm reference error could change a typical b-sheet secondary ¹³C^a shift of -1 ppm to an a-helical 3 ppm value. However, the +2 ppm b-sheet secondary ¹³C^b shift would become +6 ppm, completely incompatible with a helical conformation, preventing the residue from being misidentified as helical. To first order, the impact of ¹³C^a/¹³C^b referencing errors is small when both ¹³C^a and ¹³C^b shift data are available, and manifests itself mainly as a steeper ¹³C^a/¹³C^b chemical shift gradient when selecting fragments, and increased total energies when rescoring the energies of the Rosetta models.

The impact of ¹³C^a/¹³C^b chemical shift referencing errors on CS-Rosetta structure generation was evaluated using the chemical shift assignment datasets IIIc and IIId. When 1.0 ppm offset was added to d¹³C^a/^b (dataset IIIc), comparable to the average d¹³C^a/^b prediction errors (s in Eq 1) (Gong et al. 2007; Shen and Bax 2007), the accuracy of the selected fragments slightly decreases (Fig. S9), with a very small adverse impact on subsequent Rosetta structure generation (Figs. S10 and S11). The impact of chemical shift referencing errors appears to be insensitive to the type of fragment selection method used: For MrR16, standard MFR yields slightly better results (Fig. S10); for TM1442, the hybrid method is slightly favorable (Fig. S11).

When the d¹³C^a/^boffset error is increased to 1.7 ppm (dataset IIIc), convergence and accuracy of the resulting structures decreases noticeably (Figs. S10 and S11), but the folds remain essentially correct. However, when the offset error is increased to 2.7 ppm, which corresponds to the approximate difference between d¹³C^a/^b values referenced to TMS and DSS (Wishart et al. 1995; Markley et al. 1998), fragment selection results are poor and no acceptable structures are obtained with the CS-Rosetta protocol (data not shown).

When the chemical shift referencing error affects only a single type of nucleus, e.g. ¹³C^a or ¹³C', an erroneous bias towards selection of helical or extended fragments can occur, resulting in poorer fragment quality and decreased performance of the CS-Rosetta protocol (results not shown). Even in these cases, the impact of ¹⁵N or ¹³C chemical shift referencing errors of up to 1 ppm have very little adverse effect on CS-Rosetta performance.

Chemical shift referencing errors readily can be detected by automated methods (Moseley et al. 2004; Wang et al. 2005). For this purpose, a script has been added to the CS-Rosetta package which applies reference error corrections when the referencing error exceeds the average uncertainty in the database chemical shifts (1.0 ppm for d¹³C^a/^b and d¹³C'; 0.3 ppm for d¹H^a). These referencing corrections are based on the method described by Markley and coworkers (Wang et al. 2005), and correlations between (Dd¹³C^a-Dd¹³C^b) and Dd¹³C^a/^b/Dd¹³C'/ Dd¹H are shown in Fig. S1.

A situation similar to the chemical shift referencing problem discussed above can arise when chemical shifts are measured from TROSY spectra (Pervushin et al. 1998), when the displacement between the observed resonance frequency and the true chemical shift (¹J_NH/2 for d¹⁵N and d¹H^N) is not taken into account. However, considering that this error is much smaller than the standard error in the predicted database chemical shifts, no adjustment of the chemical shift values is required.

A larger apparent referencing error can result from deuteration effects (Venters et al. 1996; Gardner et al. 1997) on d¹³C^a (with deuterium isotope shifts of -0.5 to -0.9 ppm) and d¹³C^b (-0.7 to -1.3 ppm). These isotope effects on the backbone chemical shifts are relatively uniform and mostly smaller than the 1 ppm referencing error, discussed above. Although it is beneficial to apply uniform isotope shift corrections of +0.7 and +0.9 ppm to d¹³C^a and d¹³C^b values, respectively, the absence of such corrections shows little adverse impact on the performance of CS-Rosetta (data not shown). Nevertheless, a script has been added to the CS-Rosetta package which adjusts the d¹³C^a and d¹³C^b chemical shifts by the residue-type-specific values reported by Cavanagh et al. (2007).

Structures from solid-state NMR chemical shifts

The backbone chemical shifts d¹⁵N, d¹³C', d¹³C^a and d¹³C^b obtained by ssNMR for the proteins GB3 and ubiquitin were used as inputs for the CS-Rosetta structure generation protocols. For GB3, nearly complete ssNMR backbone chemical shift assignments, including 55 d¹⁵N, 56 d¹³C', 56 d¹³C^a and 52 d¹³C^b, are taken from (Nadaud et al. 2007). For the most part, these chemical shifts closely agree with values observed by solution NMR (Fig. S11). For ubiquitin, the ssNMR backbone chemical shift assignments taken from (Igumenova et al. 2004) are about ~90% complete, and include 65 d¹⁵N, 65 d¹³C', 67 d¹³C^a and 63 d¹³C^b values, with no chemical shift assignments for residues 8-11. With the exception of several residues involved in intermolecular contacts, these chemical shifts also agree well with values observed in solution (Igumenova et al. 2004) (Fig. S12).

Except for the ubiquitin target fragments that involve the missing residues 8-11, the quality of fragments selected on the basis of ssNMR shift values is good, with little difference apparent between results from the standard MFR and the hybrid fragment selection method (Fig. 5). As expected based on the evaluations carried out above for regions with missing assignments, the regular MFR method fares poorly when selecting fragments that include residues 8-11, whereas the hybrid method shows no decrease in structural quality for this region.

Importantly, either selection method yields fragments from the ssNMR chemical shifts that suffice for generating converged, high quality all-atom models for both proteins (Fig. 5C,F). When the MFR method is used to select the fragments, the coordinate rms deviations for GB3 between the lowest energy model and the experimental solution NMR structure are 0.71 Å for the backbone atoms (N, C^a and C') and 1.28 Å for all non-hydrogen atoms. For ubiquitin these numbers are 0.69 and 1.22 Å. When the fragments are selected by the hybrid procedure, the coordinate rmsd's are slightly higher: 0.73 and 1.70 Å for backbone and all non-hydrogen GB3 atoms, respectively, and 0.86 and 1.49 Å for ubiquitin.

Considering the generally somewhat lower spectral resolution attainable by ssNMR compared to solution NMR, detailed structural studies of globular proteins by ssNMR mostly have remained restricted to relatively small systems, typically less than ~80 residues. Clearly, CS-Rosetta provides a powerful new complementary tool for generating structural models of such proteins once chemical shift assignments have been completed, without requiring the extensive internuclear distance information which sometimes can be difficult to obtain.

Paramagnetic protein structure from chemical shifts

Two small paramagnetic proteins for which chemical shifts are available in the BMRB (Doreleijers et al. 2005) have been used to evaluate the applicability of CS-Rosetta to such systems: calbindin and ferredoxin. The backbone chemical shift assignments of calbindin, chelating a paramagnetic Yb³⁺ ion in its C-terminal metal binding site and Ca²⁺ in the N-terminal site, include 52 d¹⁵N/d¹H^N, 43 d¹³C', 37 d¹³C^a/d¹H^a and 33 d¹³C^b shifts, but no chemical shift assignments for residues 18 to 24 and 47 to 66; the completeness of the backbone chemical shift assignments is ~60% (Barnwal et al. 2008). The backbone chemical shift assignments of ferredoxin include 78 d¹⁵N/d¹H^N, 83 d¹³C', 86 d¹³C^a/d¹H^a and 78 d¹³C^b values, and assignments for residues 41-50 and 80-82; the completeness of the backbone chemical shift assignments is ~80% (Muller et al. 2002).

With the absence of chemical shift assignments for long segments in each of these two proteins, the standard CS-Rosetta protocol, using MFR fragment selection, fails to converge for both proteins (Fig. S13). However, the hybrid fragment selection procedure performs much better, in particular for those target fragments involving the unassigned residues (Fig. 6A,B,D,E), permitting the structure assembly phase to be successful (Fig.7). Interestingly, this improved performance does not result from recognition of the relatively common EF-hand and Fe-S metal-binding sites as, for testing purposes, proteins with a PSI-BLAST e-score <0.05 had been removed from the database. Subsequent manual evaluation of the 9-residue fragments covering the regions lacking chemical shifts showed the presence of six 9-residue fragments for calbindin segment 54-62, which were taken from EF-hand containing proteins that had escaped detection by the PSI-BLAST filter.

For both proteins, the Rosetta fragment assembly and relaxation procedure generates a number of good all-atom models, with the lowest energy models having backbone coordinates that differ by less than 2 Å from their respective reference structures when only including residues involved in secondary structure (Fig. 6C,F). Although, the standard convergence criterion (10 lowest energy structures cluster with 2 Å from the lowest energy structure) is not met for either protein (Fig. S13), when relaxing this limit to 3.3 Å both structures are converged.

For calbindin, the coordinate rmsd's between the lowest energy all-atom model and the 1.6-Å X-ray structure of calbindin D9K (Svensson et al. 1992) are 1.5 and 2.1 Å for the backbone atoms (N, C^a and C') and for all heavy atoms involved in secondary structure, respectively. The Ca²⁺binding loops of both metal binding sites are remarkably well formed in the CS-Rosetta structures (Fig. 7A), even with the second metal binding site lacking all of its chemical shift assignments and the absence of any restraints on metal chelation for both metal binding sites. For the first Ca²⁺binding loop, a pseudo-EF-hand, the four backbone carbonyl groups are properly positioned and point towards the location where Ca²⁺ is found in the X-ray structure. Even the bidentate sidechain ligating group of Glu²⁷ adopts a conformation suitable for metal chelation. For the second site, a regular EF-hand, the backbone carbonyl of Glu⁶⁰ and the sidechains of Asp⁵⁴ and Glu⁶⁵ are well positioned for metal binding, but the sidechains of Asn⁵⁶ and Asp⁵⁸ point away from the position where the metal ion is observed in the X-ray structure.

For the secondary structure elements of ferredoxin, the lowest energy Rosetta model deviates from the experimental NMR structure obtained for the same protein by 2.06 Å for the backbone and by 3.54 Å for all non-hydrogen atoms. Two of the four Cys sidechains that ligate the [2Fe-2S] cluster are in close proximity, even though the loop conformations differ substantially from the experimentally determined structure, (Figure 7B).

Concluding remarks

Although previous reports have clearly demonstrated the potential of using chemical shifts to determine good quality all-atom structures for small proteins (Cavalli et al. 2007; Shen et al. 2008), these studies were based on relatively ideal cases where complete or nearly complete backbone assignments were available, in the absence of assignment errors. Our present study demonstrates that the CS-Rosetta procedure and its new variant, which uses a hybrid fragment selection procedure, are remarkably tolerant to such incompleteness and errors. Clearly, a study such as the present one, which evaluates the impact of missing or erroneous assignments, is never complete. We simply have evaluated the impact for two proteins, and have made an attempt to evaluate representative cases of missing assignments. Both proteins chosen for the current study, MrR16 and TM1442, yielded good (albeit not exceptional) results when originally studied with complete data sets, and these systems therefore are likely to be more robust to incompleteness or assignment errors than proteins which only yield borderline convergence to begin with.

The CS-Rosetta protocol uses the chemical shift information at two stages: first for fragment selection, and then again when evaluating the final full-atom models. There are two primary reasons for the improved performance of the CS-Rosetta protocol over a conceptually similar, earlier attempt to integrate chemical shift information into Rosetta (Bowers et al. 2000). First, the quality of fragments selected has improved considerably by the use of SPARTA to "assign" better chemical shifts to a structural database. SPARTA uses both a more advanced algorithm to assign these chemical shifts, but also benefits from a considerable expansion of entries in the BMRB for which complete chemical shift and high resolution structural information is available (Doreleijers et al. 2005). Second, a number of improvements in the Rosetta Monte-Carlo assembly process have been made in recent years, most notably the incorporation of explicit all atom refinement with a physically realistic force field (Das and Baker 2008).

The adverse impact of errors and incompleteness on the CS-Rosetta protocol results primarily from decreased quality of the fragment library, and has relatively little impact on the rescoring of the final full-atom models. The hybrid CS-Rosetta protocol first limits the selection of fragments to a ~0.1% fraction of the total structural database on the basis of the standard Rosetta selection mechanism. In the next step, it uses MFR to select the 200 fragments from this ensemble that agree best with experimental chemical shifts. This reduces the impact of chemical shift errors because only fragments compatible with standard Rosetta criteria are available for selection. Moreover, in the absence of any chemical shift information, the Rosetta pre-selection of the top 0.1% fragments yields better results than the less sophisticated MFR procedure, which had been designed primarily to find fragments with similar chemical shifts and/or RDCs (Delaglio et al. 2000; Kontaxis et al. 2005). In the absence of assignment errors or missing assignments, the initial Rosetta pre-selection used in the hybrid procedure is not beneficial and actually results in a small decrease in performance. On the other hand, for cases where significant fractions of assignments are missing or ambiguous, the hybrid procedure is considerably more robust.

For all evaluations, including those of the two paramagnetic proteins, homologous proteins were first eliminated from the structural database. In practice, this is clearly disadvantageous as Rosetta no longer can take advantage of standard structural elements, such as Ca²⁺-ligating EF-hand sequences, present in the database. Indeed 30 proteins containing a total of 64 EF-hands were removed prior to fragment searching. Similarly, proteins containing the relatively common Fe₂S₂ cluster were removed prior to searching for fragments for ferredoxin assembly. While for calbindin the CS-Rosetta protocol resulted in remarkably good backbone structures for its metal binding sites, even in the absence of chemical shift information, loop conformations in ferredoxin were poor. Nevertheless, using the hybrid protocol, CS-Rosetta was able to generate the remainder of the ferredoxin structure quite well, suggesting that even for these challenging systems the method will be quite useful.

For the two proteins for which a structure was generated from solid state NMR chemical shifts, lacking ¹H chemical shifts, the standard MFR-based protocol and the hybrid CS-Rosetta method performed comparably well. For both proteins, the final structures obtained from these smaller input data sets approach the quality of structures obtained from solution NMR chemical shifts, indicating that CS-Rosetta may be a particularly useful complement when working with samples in the solid state.

Although CS-Rosetta considerably reduces the amount of spectral data collection time required for structure generation compared to conventional procedures, the amount of computational time required typically is very high. Although for simple systems such as GB3, generation of less than one hundred structures may suffice to reach convergence (Shen et al. 2008), for many other proteins as many as 10,000 models may be required. Rosetta assembly and minimization of each model takes 5-10 minutes on a single CPU, and in practice use of a large cluster or a central server such as BOINC is required to take advantage of this technology.

We also note that the CS23D program (Wishart et al. 2008) performs very well for the test datasets used in our study (Supplementary Material). The major strength of CS23D is that it takes optimal advantage of sequence homologues present in the database during fragment selection. Such homologues were present in the structural database for all six proteins evaluated in our work (see Supplementary Material Table S2), but were excluded from the database for CS-Rosetta testing. On the other hand, based on a limited number of tests, techniques such as CS-Rosetta and Cheshire are believed to be superior for proteins that lack significant homology to previously solved structures.

Software availability

The CS-Rosetta software package with its newly implemented hybrid fragment selection module can be downloaded from https://spin.niddk.nih.gov/bax-apps/.

Acknowledgments

This work was funded by the Intramural Research Program of the NIDDK, NIH, and by the Intramural AIDS-Targeted Antiviral Program of the Office of the Director, NIH; the NIGMS, NIH, and the Howard Hughes Medical Institutes (to D.B.). We also thank Rosetta@home participants and the BOINC project for contributing computing power.

Supplementary Material Available: A brief discussion of CS23D results for proteins discussed in this study; multiple figures detailing the quality of the selected fragments and the CS-Rosetta results for various combinations of input chemical shift data.

Figure 1. Flow chart of CS-Rosetta structure generation protocol. In the hybrid fragment selection procedure, shown in red, step 1 selects 200 fragments from an initial cohort of 2000 fragments which has been extracted from the structural database by standard Rosetta methods. In the standard CS-Rosetta method, step 1 takes its fragments directly from the 2,200,000 fragments present in the structural database.

Figure 2. CS-Rosetta structure generation for TM1442 with missing chemical shift assignments for certain types of nuclei. (A, B) Plots of accuracy of fragments selected using the MFR (blue) and hybrid (red) methods with the chemical shift inputs for d¹⁵N, d¹H^N and d¹³C^a (as contained in the dataset Ih). Quality of three-residue (A) and nine-residue (B) fragments is represented by the average (bold lines) and lowest (lines with dots) rmsd of 200 selected fragments relative to the experimental coordinates of the corresponding TM1442 segment. (C and D) plots of Rosetta all-atom energy, rescored by using the input chemical shifts (as contained in the dataset Ih), versus C^armsd relative to the experimental TM1442 structure, for CS-Rosetta models obtained using MFR (C) and hybrid (D) fragment selection methods. (E-H) CS-Rosetta fragment selections and structure generations for TM1442 using only d¹³C^a and d¹³C^b (as contained in the dataset Ii).

Figure 3. CS-Rosetta structure for TM1442 with missing chemical shifts. (A and B) Plots of accuracy of fragment candidates selected using the MFR (blue) and hybrid (red) methods using chemical shift values d¹⁵N, d¹H^N, d¹³C^a, d¹³C^b, d¹³C' and d¹H^a for residues 1-20, 30-51 and 60-120 (as contained in the dataset IIe). For each three-residue (A) and nine-residue (B) segment of TM1442, 200 fragments were selected. Average (bold lines) and lowest (lines with dots) rmsd of these fragments relative to the experimental coordinates of the corresponding TM1442 segment are plotted with respect to the position of the first segment residue in the TM1442 sequence. The regions corresponding to the “unassigned” residues are shaded; the secondary structure elements are displayed at the top. (C, D) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in dataset IIe), versus C^a rmsd relative to the experimental TM1442 structure, for CS-Rosetta models obtained using MFR (C) and hybrid (D) fragment selection methods.

Figure 4. CS-Rosetta structure generation of TM1442 with chemical shift errors. (A,B) Plots of accuracy of fragments selected using the MFR (blue) and hybrid (red) methods, with the inputs swapped for the d¹⁵N, d¹H^N, d¹³C^a, d¹³C^b, d¹³C^' and d¹H^a assignments of dipeptides Ser⁵²-Ser⁵³ and Ser⁸²-Ser⁸³ (as contained in the dataset IIIb). For each three-residue (A) and nine-residue (B) segment of TM1442, 200 fragments were selected. Average (bold lines) and lowest (lines with dots) rmsd of these fragments relative to the experimental coordinates of the corresponding TM1442 segment are plotted with respect to the position of the first segment residue in the TM1442 sequence. The regions corresponding to the “miss-assigned” residues are shaded; secondary structure elements are displayed at the top. (C, D) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in the dataset IIIb), versus C^a rmsd relative to the experimental TM1442 structure, for CS-Rosetta models obtained using MFR (C) and hybrid (D) fragment selection methods.

Figure 5. CS-Rosetta fragment selection and structure generation for GB3 (A-C) and ubiquitin (D-F), using chemical shift assignments from solid-state NMR. (A,D) Plots of the lowest (upper panel) and average (lower panel) backbone coordinate rmsds (N, C^a and C') between query segment and two hundred 3-residue fragments, selected using the MFR (blue) and hybrid methods (red), as a function of starting position in the sequence. (B,E) same as (A,D) but for 9-residue fragments. (C,F) Plots of Rosetta all atom energy, rescored by using the experimental ssNMR chemical shifts, versus C^a rmsd relative to the experimental NMR structures of GB3 and ubiquitin for the CS-Rosetta all-atom models obtain using MFR-selected (upper panel, blue dots) and the hybrid method (lower panel, red dots) fragments. The solid black lines in (C,F) represent the normalized number of structures found at a given C^a-rmsd.

Figure 6. CS-Rosetta structure generation for paramagnetic calbindin (A-C) and ferredoxin (D-F). (A,D) Plots of the lowest (upper panel) and average (lower panel) backbone coordinate rmsds (N, C^a and C') between query segment and two hundred 3-residue fragment candidates, selected using the MFR (blue) and hybrid methods (red), as a function of starting position in the sequence. The regions lacking chemical shift assignments are shaded. (B,E) Same as (A,D), but for 9-residue fragments. (C,F) Plots of Rosetta all-atom energy, rescored by the experimental chemical shifts, versus C^a rmsd of final al-atom models (including only residues located in elements of secondary structure) relative to the corresponding X-ray (calbindin) and NMR (ferredoxin) structure. Only results from CS-Rosetta all-atom models obtained by the hybrid fragment selection procedure are shown; when using fragments from the standard MFR method, Rosetta fails to converge. Residues included in the backbone rmsd calculation include 3-14, 25-40, 46-53 and 63-74 for calbindin, and 4-11, 15-22, 27-34, 54-56, 71-75 and 91-93 for ferredoxin.

Figure 7. Comparison of experimental (blue) and lowest energy CS-Rosetta (red) structure for paramagnetic calbindin (A) and ferredoxin (B). Superposition is optimized for residues in secondary structure, defined in the caption to Fig. 6. The sidechains of residues involved in metal binding including their metal-ligating oxygen atoms, as well as the X-ray positions of the Ca²⁺ ions (cyan) are shown. Metal-ligating residues (atoms) include Ala¹⁴ (O), Glu¹⁷ (O), Asp¹⁹ (O), Gln²² (O), Glu²⁷ (O^e¹/O^e²), Asp⁵⁴ (O^d¹), Asn⁵⁶ (O), Asp⁵⁸ (O^d¹), Glu⁶⁰ (O) and Glu⁶⁵ (O^e¹/O^e²). (B) Backbone ribbon representation of the lowest-energy CS-Rosetta structure (red) superimposed on the experimental X-ray structure (blue) for ferredoxin, with superposition optimized for the residues in secondary structure (see caption to Fig. 6). The sidechain S atoms of Cys⁴², Cys⁴⁷, Cys⁵⁰ and Cys⁸², which coordinate the [2Fe-2S] cluster are marked as solid spheres. Figures made using Molmol (Koradi et al. 1996).

References

Agarwal V, Diehl A, Skrynnikov N and Reif B (2006) High resolution H-1 detected H-1,C-13 correlation spectra in MAS solid-state NMR using deuterated proteins with selective H-1,H-2 isotopic labeling of methyl groups. J. Am. Chem. Soc. 128: 12620-12621

Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W and Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402

Ando I, Kameda T, Asakawa N, Kuroki S and Kurosu H (1998) Structure of peptides and polypeptides in the solid state as elucidated by NMR chemical shift. J. Mol. Struct. 441: 213-230

Andreini C, Bertini I and Rosato A (2004) A hint to search for metalloproteins in gene banks. Bioinformatics 20: 1373-1380

Asakura T, Demura M, Date T, Miyashita N, Ogawa K and Williamson MP (1997) NMR study of silk I structure of Bombyx mori silk fibroin with N-15- and C-13-NMR chemical shift contour plots. Biopolymers 41: 193-203

Barnwal RP, Rout AK, Chary KVR and Atreya HS (2008) Rapid Measurement of Pseudocontact Shifts in Paramagnetic Proteins by GFT NMR Spectroscopy. Open Magn. Reson. J. 1: 16-28

Bermel W, Bertini I, Felli IC, Piccioli M and Pierattelli R (2006) C-13-detected protonless NMR spectroscopy of proteins in solution. Prog. Nucl. Magn. Reson. Spectrosc. 48: 25-45

Bertini I, Luchinat C, Parigi G and Pierattelli R (2005) NMR spectroscopy of paramagnetic metalloproteins. Chembiochem 6: 1536-1549

Bowers PM, Strauss CEM and Baker D (2000) De novo protein structure determination using sparse NMR data. J. Biomol. NMR 18: 311-318

Case DA (1995) Calibration of ring-current effects in proteins and nucleic acids. J. Biomol. NMR 6: 341-346

Castellani F, van Rossum B, Diehl A, Schubert M, Rehbein K and Oschkinat H (2002) Structure of a protein determined by solid-state magic-angle-spinning NMR spectroscopy. Nature 420: 98-102

Castellani F, van Rossum BJ, Diehl A, Rehbein K and Oschkinat H (2003) Determination of solid-state NMR structures of proteins by means of three-dimensional N-15-C-13-C-13 dipolar correlation spectroscopy and chemical shift analysis. Biochemistry 42: 11476-11483

Cavalli A, Salvatella X, Dobson CM and Vendruscolo M (2007) Protein structure determination from NMR chemical shifts. Proc. Natl. Acad. Sci. U. S. A. 104: 9615-9620

Chevelkov V, Rehbein K, Diehl A and Reif B (2006) Ultrahigh resolution in proton solid-state NMR spectroscopy at high levels of deuteration. Angewandte Chemie-International Edition 45: 3878-3881

Cornilescu G, Delaglio F and Bax A (1999) Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J. Biomol. NMR 13: 289-302

Cornilescu G, Marquardt JL, Ottiger M and Bax A (1998) Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. J. Am. Chem. Soc. 120: 6836-6837

Das R and Baker D (2008) Macromolecular modeling with Rosetta. Annu. Rev. Biochem. 77: 363-382

Delaglio F, Kontaxis G and Bax A (2000) Protein structure determination using Molecular Fragment Replacement and NMR dipolar couplings. J. Am. Chem. Soc. 122: 2142-2143

Doreleijers JF, Nederveen AJ, Vranken W, Lin JD, Bonvin A, Kaptein R, Markley JL and Ulrich EL (2005) BioMagResBank databases DOCR and FRED containing converted and filtered sets of experimental NMR restraints and coordinates from over 500 protein PDB structures. J. Biomol. NMR 32: 1-12

Gardner KH, Rosen MK and Kay LE (1997) Global folds of highly deuterated, methyl-protonated proteins by multidimensional NMR. Biochemistry 36: 1389-1401

Gong HP, Shen Y and Rose GD (2007) Building native protein conformation from NMR backbone chemical shifts using Monte Carlo fragment assembly. Protein Science 16: 1515-1521

Gryk MR and Hoch JC (2008) Local knowledge helps determine protein structures. Proc. Natl. Acad. Sci. U. S. A. 105: 4533-4534

Haigh CW and Mallion RB (1979) Ring current theories in nuclear magnetic resonance. Prog. Nucl. Magn. Reson. Spectrosc. 13: 303-344

Igumenova TI, McDermott AE, Zilm KW, Martin RW, Paulson EK and Wand AJ (2004) Assignments of carbon NMR resonances for microcrystalline ubiquitin. J. Am. Chem. Soc. 126: 6720-6727

Ikura M, Kay LE and Bax A (1990) A novel approach for sequential assignment of ¹H, ¹³C, and ¹⁵N spectra of larger proteins: heteronuclear triple-resonance three-dimensional NMR spectroscopy. application to calmodulin. Biochemistry 29: 4659-4667

Kontaxis G, Delaglio F and Bax A (2005) Molecular fragment replacement approach to protein structure determination by chemical shift and dipolar homology database mining. Meth. Enzymol. 394: 42-78

Koradi R, Billeter M and Wuthrich K (1996) MOLMOL: a program for display and analysis of macromolecular structures. J. Mol. Graph. 14: 51-55

Loquet A, Bardiaux B, Gardiennet C, Blanchet C, Baldus M, Nilges M, Malliavin T and Boeckmann A (2008) 3D structure determination of the Crh protein from highly ambiguous solid-state NMR restraints. J. Am. Chem. Soc. 130: 3579-3589

Manolikas T, Herrmann T and Meier BH (2008) Protein structure determination from C-13 spin-diffusion solid-state NMR spectroscopy. J. Am. Chem. Soc. 130: 3959-3966

Markley JL, Bax A, Arata Y, Hilbers CW, Kaptein R, Sykes BD, Wright PE and Wuthrich K (1998) IUPAC-IUBMB-IUPAB inter-union task group on the standardization of data bases of protein and nucleic acid structures determined by NMR spectroscopy (vol 70, pg 117, 1998). Pure Appl. Chem. 70: AR1-AR1

Montelione GT and Wagner G (1990) Conformation-Independent Sequential Nmr Connections in Isotope- Enriched Polypeptides By H-1-C-13-N-15 Triple-Resonance Experiments. J. Magn. Reson. 87: 183-188

Moseley HNB, Sahota G and Montelione GT (2004) Assignment validation software suite for the evaluation and presentation of protein resonance assignment data. J. Biomol. NMR 28: 341-355

Muller J, Lugovskoy AA, Wagner G and Lippard SJ (2002) NMR structure of the [2Fe-2S] ferredoxin domain from soluble methane monooxygenase reductase and interaction with its hydroxylase. Biochemistry 41: 42-51

Nadaud PS, Helmus JJ and Jaroniec CP (2007) 13C and 15N chemical shift assignments and secondary structure of the B3 immunoglobulin-binding domain of streptococcal protein G by magic-angle spinning solid-state NMR spectroscopy. Biomol. NMR Assignments 1: 117-120

Neal S, Berjanskii M, Zhang HY and Wishart DS (2006) Accurate prediction of protein torsion angles using chemical shifts and sequence homology. Magn. Reson. Chem. 44: S158-S167

Neal S, Nip AM, Zhang HY and Wishart DS (2003) Rapid and accurate calculation of protein H-1, C-13 and N-15 chemical shifts. J. Biomol. NMR 26: 215-240

Pervushin k, Riek R, Wider G and Wuthrich K (1998) Transverse Relazation-Optimized Spectroscopy (TROSY) for NMR Studies of Aromatic Spin Systems in 13C-Labeled Proteins. J. Am. Chem. Soc. 120: 6394-9400

Rohl CA, Strauss CEM, Misura KMS and Baker D (2004) Protein structure prediction using rosetta. Meth. Enzymol. 383: 66-93

Saito H (1986) Conformation-dependent C13 chemical shifts - A new means of conformational characterization as obtained by high resolution solid state C13 NMR. Magn. Reson. Chem. 24: 835-852

Shen Y and Bax A (2007) Protein backbone chemical shifts predicted from searching a database for torsion angle and sequence homology. J. Biomol. NMR 38: 289-302

Shen Y, Lange O, Delaglio F, Rossi P, Aramini JM, Liu GH, Eletsky A, Wu YB, Singarapu KK, Lemak A, Ignatchenko A, Arrowsmith CH, Szyperski T, Montelione GT, Baker D and Bax A (2008) Consistent blind protein structure generation from NMR chemical shift data. Proc. Natl. Acad. Sci. U. S. A. 105: 4685-4690

Siemer AB, Ritter C, Ernst M, Riek R and Meier BH (2005) High-resolution solid-state NMR spectroscopy of the prion protein HET-s in its amyloid conformation. Angewandte Chemie-International Edition 44: 2441-2444

Spera S and Bax A (1991) Empirical correlation between protein backbone conformation and Ca and Cb 13C nuclear magnetic resonance chemical shifts. Journal of American Chemical Society 113: 5490-5492

Svensson LA, Thulin E and Forsen S (1992) Proline cis-trans isomers in calbindin D9K observed by X-ray crystallography. J. Mol. Biol. 223: 601-606

Tycko R (1996) Prospects for resonance assignments in multidimensional solid-state NMR spectra of uniformly labeled proteins. J. Biomol. NMR 8: 239-251

Ulmer TS, Ramirez BE, Delaglio F and Bax A (2003) Evaluation of backbone proton positions and dynamics in a small protein by liquid crystal NMR spectroscopy. J. Am. Chem. Soc. 125: 9179-9191

Venters RA, Farmer BT, Fierke CA and Spicer LD (1996) Characterizing the use of perdeuteration in NMR studies of large proteins C-13, N-15 and H-1 assignments of human carbonic anhydrase II. J. Mol. Biol. 264: 1101-1116

Wagner G, Pardi A and Wuthrich K (1983) Hydrogen-Bond Length And H-1-Nmr Chemical-Shifts In Proteins. J. Am. Chem. Soc. 105: 5948-5949

Wang LY, Eghbalnia HR, Bahrami A and Markley JL (2005) Linear analysis of carbon-13 chemical shift differences and its application to the detection and correction of errors in referencing and spin system identifications. J. Biomol. NMR 32: 13-22

Williamson MP and Asakura T (1993) Empirical Comparisons Of Models For Chemical-Shift Calculation In Proteins. J. Magn. Reson. B 101: 63-71

Williamson MP, Kikuchi J and Asakura T (1995) Application of H1 NMR chemical shifts to measure the quality of protein structures. J. Mol. Biol. 247: 541-546

Wishart DS, Arndt D, Berjanskii M, Tang P, Zhou J and Lin G (2008) CS23D: a web server for rapid protein structure generation using NMR chemical shifts and sequence data. Nucleic Acids Res. 36: 496-502

Wishart DS, Bigam CG, Holm A, Hodges RS and Sykes BD (1995) 1H, 13C and 15N random coil NMR chemical shifts of the common amino acids. I. Investigations of nearest-neighbor effects. J. Biomol. NMR 5: 67-81

Wishart DS, Sykes BD and Richards FM (1991) Relationship between nuclear magnetic resonance chemical shift and protein secondary structure. J. Mol. Biol. 222: 311-333

Zech SG, Wand AJ and McDermott AE (2005) Protein structure determination by high-resolution solid-state NMR spectroscopy: Application to microcrystalline ubiquitin. J. Am. Chem. Soc. 127: 8618-8626

A subset of the structure calculations described in the main text were also carried out using the CS23D program. As can be seen from the results presented in table S1, CS23D performs very well for the test datasets used in our study. The major strength of CS23D is that it takes optimal advantage of sequence homologues present in the database during fragment selection. Such homologues were present in the structural database for all six proteins studied in our work (see Table S2), but were excluded from the database for the CS-Rosetta testing. The current implementation of CS23D allows exclusion only of the model(s) with "exact matching structure", and performance of the CS23D program in the absence of such models therefore could not be evaluated for the proteins studied. We note that based on results described by Wishart (http://busby1.cs.ualberta.ca/CS23D/documentation.html), the success rate of CS23D is considerably lower for proteins which have no homologues in its NR PDB database.

Figure S1. Correlation plots between protein backbone secondary chemical shifts. The experimental chemical shift data chosen from the SPARTA database contains 21,338 d¹³C^a/^b and 21,338 ¹H^a. The Dd¹³C^a, Dd¹³C^b and Dd¹H^a are plotted against Dd¹³C^a-Dd¹³C^b; the best fitting are calculated and labeled for positive and negative Dd¹³C^a-Dd¹³C^b, respectively, and plotted with red lines.

Figure S2. Fragment selections for proteins with missing chemical shifts of certain nucleus types. For MrR16 (A-K and A′-K′) and TM1442 (A′′-K′′ and A′′′-K′′′), 200 fragment candidates were selected using the MFR and the hybrid fragment selection methods, respectively, for each overlapping segment in the proteins. (A-K) Plots of the lowest (lines with dots) and average (bold lines) backbone coordinate rmsds (N, C^a and C’) between query segment and 200 3-residue fragment candidates, selected using the MFR (blue), the hybrid method (red), or the standard Rosetta method (black) with the inputs of the simulated chemical shift assignment datasets Ia-Ik as listed in Table 1, as a function of starting position in the sequence of MrR16. (A′-K′) same as (A-K) but for the 9-residue fragment candidates. (A′′-K′′ and A′′′-K′′′) same as (A-K and A′-K′) but for the fragment candidates of protein TM1442.

Figure S3. CS-Rosetta structure generation of protein MrR16 with missing chemical shifts of certain nucleus types, using either the MFR fragment selection (blue) or hybrid fragment selection (red) method. (A-J) Plots of Rosetta all-atom energy versus C^a rmsd relative to the experimental MrR16 structure for the CS-Rosetta models generated by using the MFR-selected fragment candidates with the inputs of the simulated chemical shift assignment datasets Ia-Ij as listed in Table 1. The (normalized) number of structures found for a given C^a-rmsd is plotted at the bottom of each panel. (A′-J′) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in the datasets Ia-Ij), versus C^a rmsd relative to the experimental MrR16 structure for the CS-Rosetta models generated by using the MFR fragment selection method. (A′′-J′′ and A′′′-J′′′) same as (A-J and A′-J′) but for the CS-Rosetta all-atom models generated using the hybrid fragment selection method.

Figure S4. CS-Rosetta structure generation of protein TM1442 with missing chemical shifts of certain types of nuclei. (A-J) Plots of Rosetta all-atom energy versus C^a rmsd relative to the experimental TM1442 structure for the CS-Rosetta models obtained by using the MFR-selected fragment candidates with the inputs of the simulated chemical shift assignment datasets Ia-Ij as listed in Table 1. The (normalized) number of structures found for a given C^a-rmsd is plotted at the bottom of each panel. (A′-J′) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in the datasets Ia-Ij), versus C^a rmsd relative to the experimental TM1442 structure for the CS-Rosetta models generated by using the MFR fragment selection method. (A′′-J′′ and A′′′-J′′′) same as (A-J and A′-J′) but for the CS-Rosetta all-atom models generated using the hybrid fragment selection method.

Figure S5. Reference Rosetta structure generation for two test proteins, in the absence of any chemical shift information. (A,B) Plots of Rosetta empirical energy versus C^a rmsd relative to the experimental NMR structure for MrR16 (A) and TM1442 (B) for 10,000 Rosetta all-atom models. (C,D) Plots of Rosetta empirical energy versus C^a rmsd relative to the model with the lowest Rosetta energy (shown as bold dot on the vertical axis) for the MrR16 (C) and TM1442 (D) models. For both proteins, the lowest-energy Rosetta folds are roughly correct, but the MrR16 results do not meet convergence criteria, and the TM1442 only meets more relaxed convergence criteria (10 lowest energy models within 4 Å C^a rmsd from the lowest energy model).

Figure S6. Fragment selections for proteins with missing chemical shift assignments of certain residues. For MrR16 (A-E and A′-E′) and TM1442 (A′′-E′′ and A′′′-E′′′), 200 fragment candidates were selected using the MFR and the hybrid fragment selection methods, respectively, for each overlapping segment in the proteins. (A-E) Plots of the lowest (lines with dots) and average (bold lines) backbone coordinate rmsds (N, C^a and C’) between query segment and 200 3-residue fragment candidates, selected using the MFR (blue) and the hybrid (red) methods with the inputs of the simulated chemical shift assignment datasets IIa-IIe (see Method), as a function of starting position in the sequence of MrR16. The regions corresponding to the “unassigned” residues are shaded; the secondary structure elements are displayed at the top of each column. (A′-E′) same as (A-E) but for the 9-residue fragment candidates. (A′′-E′′ and A′′′-E′′′) same as (A-E and A′-E′) but for the fragment candidates of protein TM1442.

Figure S7. CS-Rosetta structure generation of MrR16 with missing chemical shifts of certain residues. (A-E) Plots of Rosetta all-atom energy versus C^a rmsd relative to the experimental MrR16 structure for the CS-Rosetta models obtained by using the MFR-selected fragment candidates with the inputs of the simulated chemical shift assignment datasets IIa-IIe (see Method). The (normalized) number of structures found for a given C^a-rmsd is plotted at the bottom of each panel. (A′-E′) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in the datasets IIa-IIe), versus C^a rmsd relative to the experimental MrR16 structure. (A′′-E′′ and A′′′-E′′′) same as (A-E and A′-E′) but for the CS-Rosetta models generated using the hybrid fragment selection method.

Figure S8. CS-Rosetta structure generation of protein TM1442 with missing chemical shifts of certain residues. (A-E) Plots of Rosetta all atom energy versus C^a rmsd relative to the experimental TM1442 structure for CS-Rosetta models obtained by using the MFR-selected fragment candidates with the inputs of the simulated chemical shift assignment datasets IIa-IIe (see Method). The (normalized) number of structures found for a given C^a-rmsd is plotted at the bottom of each panel. (A′-E′) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in the datasets IIa-IIe), versus C^a rmsd relative to the experimental TM1442 structure. (A′′-E′′ and A′′′-E′′′) same as (A-E and A′-E′) but for the CS-Rosetta models generated using the hybrid fragment selected method.

Figure S9. Fragment selections for proteins with chemical shift errors. For proteins MrR16 (A-D and A′-D′) and TM1442 (A′′-D′′ and A′′′-D′′′), 200 fragment candidates were selected using the MFR and the hybrid fragment selection methods, respectively, for each overlapping segment in the proteins. (A-D) Plots of the lowest (lines with dots) and average (bold lines) backbone coordinate rmsds (N, C^a and C’) between query segment and 200 3-residue fragment candidates, selected by using the MFR (blue) and the hybrid (red) methods and with the inputs of the simulated chemical shift assignment datasets IIIa-IIId (see Method), as a function of starting position in the sequence of MrR16. The regions corresponding to the “miss-assigned” residues are shaded; the secondary structure elements are displayed at the top of each column. (A′-D′) same as (A-D) but for the 9-residue fragment candidates. (A′′-D′′ and A′′′-D′′′) same as (A-D and A′-D′) but for the fragment candidates of protein TM1442.

Figure S10. CS-Rosetta structure generation of protein MrR16 with chemical shifts errors. (A-D) Plots of Rosetta all atom energy versus C^a rmsd relative to the experimental MrR16 structure for the CS-Rosetta models obtained by using the MFR-selected fragment candidates with the inputs of the simulated chemical shift assignment datasets IIIa-IIId (see Method). The (normalized) number of structures found for a given C^a-rmsd is plotted at the bottom of each panel. (A′-D′) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in the datasets IIIa-IIId), versus C^a rmsd relative to the experimental MrR16 structure. (A′′-D′′ and A′′′-D′′′) same as (A-D and A′-D′) but for the CS-Rosetta models generated using the hybrid fragment selected method.

Figure S11. CS-Rosetta structure generation of protein TM1442 with chemical shifts errors. (A-D) Plots of Rosetta all atom energy versus C^a rmsd relative to the experimental TM1442 structure for the CS-Rosetta models obtained by using the MFR-selected fragment candidates with the inputs of the simulated chemical shift assignment datasets IIIa-IIId (see Method). The (normalized) number of structures found for a given C^a-rmsd is plotted at the bottom of each panel. (A′-D′) Plots of Rosetta all atom energy, rescored by using the input chemical shifts (as contained in the datasets IIIa-IIId), versus C^a rmsd relative to the experimental TM1442 structure. (A′′-D′′ and A′′′-D′′′) same as (A-D and A′-D′) but for the CS-Rosetta models generated using the hybrid fragment selected method.

Figure S12. Difference between chemical shifts obtained using solid-state and solution NMR spectroscopy for protein GB3 (Left) and Ubiquitin (Right). For each protein, the differences of d¹³C^a (A,E), d¹³C^b (B,F), d¹³C' (C,G) and d¹⁵N (D,H) are plotted.

Figure S13. CS-Rosetta structure generation for paramagnetic proteins. For proteins calbindin (A,B,C) and ferredoxin (D,E,F), the all-atom models were generated by using a CS-Rosetta protocol with the MFR and the hybrid fragment selection methods separately, and their Rosetta all atom energy are plotted in blue (A,B,D,E) and red (C,F), respectively, with respect to their quality. (A,D) Plots of Rosetta all-atom energy, rescored by using the experimental NMR chemical shifts, versus C^a rmsd of all-atom models relative to the experimental structures. (B,C,E,F) Plots of Rosetta all-atom energy, rescored by using the experimental NMR chemical shifts, versus C^a rmsd of all-atom models relative to the model with the lowest energy (shown as a bold dot on the vertical axis). C^a rmsd values are calculated for the residues in secondary structure only, which contain residues 3-14, 25-40, 46-53 and 63-74 for calbindin, 4-11, 15-22, 27-34, 54-56, 71-75 and 91-93 for ferredoxin, respectively.

Dataset	RMS^{MFR #}	RMS^{Hybrid *}	RMS^CS23D║	Dataset	RMS^{MFR #}	RMS^{Hybrid *}	RMS^CS23D║
MrR16				TM1442
Ii	2.39/2.97	2.22/2.83	1.63/2.44	Ii	1.76/2.40	1.51/2.19	2.15/2.64
Ij	1.52/2.28	2.40/3.24	2.03/2.68	Ij	1.09/1.88	1.08/1.74	1.90/2.45
IIe	X^¶	2.08/2.57	1.77/2.50	IIe	X	2.31/2.98	1.92/2.44
IIIb	2.46/3.19	2.04/2.76	1.78/2.46	IIIb	X	1.65/2.25	1.97/2.55

GB3	0.71/1.28	0.73/1.70	0.78/1.31	Ubiquitin	0.69/1.22	0.86/1.49	0.85/1.40

Calbindin	X	1.50/2.10	2.53/3.20	Ferredoxin	X	2.06/3.54	2.24/3.85

Protein Name	N^{NR #}	N^{CS *}
MrR16	2	0
TM1442	32	3
Gb3	104	1
Ubiquitin	190	6
Calbindin	167	30
Ferredoxin	82	7
^# N^NR: number of homologues in the NR PDB database, most of which are used by CS23D fragment search (only the one with “exact matching structures” is excluded). ^* N^CS: number of homologues in the CS-ROSETTA database, all of which are excluded from the CS-Rosetta fragment search in this work.