CS-ROSETTA: System for chemical shifts based protein structure prediction using ROSETTA

As described in papers:

Consistent blind protein structure generation from NMR chemical shift data 
Yang Shen, Oliver Lange, Frank Delaglio, et al.
Proc Natl Acad Sci USA, (2008) 105, 4685-4690

De novo protein structure generation from incomplete chemical shift assignments
Yang Shen, Robert Vernon, David Baker and Ad Bax
J Biomol NMR, (2009) 43, 63-78

Contact:       shenyang@niddk.nih.gov; bax@nih.gov
Web:       http://spin.niddk.nih.gov/bax

csRosetta Logo

What is CS-ROSETTA?

To date, interpretation of isotropic chemical shifts in structural terms is largely based on empirical correlations gained from the mining of protein chemical shifts deposited in the BMRB, in conjunction with the known corresponding 3D structures. Chemical-Shift-ROSETTA (CS-ROSETTA) is a robust protocol to exploit this relation for de novo protein structure generation, using as input parameters the 13CA, 13CB, 13C', 15N, 1HA and 1HN NMR chemical shifts. These shifts are generally available at the early stage of the traditional NMR structure determination procedure, prior to the collection and analysis of structural restraints. The CS-ROSETTA approach, as shown below, utilizes SPARTA-based selection of protein fragments from the PDB, in conjunction with a regular ROSETTA Monte Carlo assembly and relaxation procedure. Evaluation of 16 proteins, varying in size from 56 to 129 residues yielded full atom models that deviate by 0.7-1.8 angstrom backbone rnsd from the experimentally determined X-ray or NMR structures. The strategy also has been successfully applied in a blind manner a set of structural genomics targets with molecular weights up to 16 kDa, whose conventional NMR structure determination was conducted in parallel.

In addition, an alternative CS-ROSETTA fragment selection protocol is provided that improves robustness of the method for proteins with missing or erroneous NMR chemical shift input data. This strategy, which uses traditional Rosetta for pre-filtering of the fragment selection process, is demonstrated for two paramagnetic proteins and also for two proteins with solid-state NMR chemical shift assignments.

csRosetta flowchart


Contents

Download and installation

Preparation of input chemical shift table
    Identification and exclusion of flexible tails and loops

How to use CS-ROSETTA (with ROSETTA2.x)
    Fragment selection
       MFR method
       Hybrid method
    Protein structure generation
    Evaluation of CS-ROSETTA models

How to use CS-ROSETTA (with ROSETTA3.x) (new)

How to select CS-ROSETTA models
    Criteria for convergence and accepting models
    Number of models required

How to use CS-ROSETTA for symmetric homomultimers (new)

How to use CS-ROSETTA with RDC data (new)

FAQs (new)

 

Download and installation

Installation files (for Linux/Mac)

Install.com CS-ROSETTA installation C-shell script(last modified at 11/17/2009, version 1.01)
CSRosetta.tar.Z CS-ROSETTA main package, scripts, examples (last modified at 11/17/2009, version 1.01)
PDBH.tar.Z Database of PDB files, required by MFR/CS-ROSETTA
CS.tar.Z Database of chemical shifts files, required by MFR/CS-ROSETTA
ANGLESS.tar.Z Database of secondary structure classifications and ROSETTA-idealized backbone torsion angles, required by CS-ROSETTA
"Hybrid" database New database (required for "hybrid" fragment selection method)
changeLog  


Installation

The current implementation of CS-ROSETTA requires the MFR program to perform fragment selection and the ROSETTA program to conduct the protein structure prediction. Therefore, before installing and using the CS-ROSETTA package, the newest NMRPipe (incl. the MFR modules stored in dyn.tar.Z, mfr.tar.Z, pdbH.tar.Z) (http://spin.niddk.nih.gov/NMRPipe) and ROSETTA programs (http://www.rosettacommons.org/software/) MUST to be installed.

Note: If your NMRPipe was obtained and installed before 07/2007, please email Frank Delaglio to request the newest NMRPipe package in order to run CS-ROSETTA

Installation of ROSETTA (2.3)

To install the ROSETTA2.x program, users need to (1) register for a license at http://www.rosettacommons.org/software/, (2) download (at least) the required packages (RosettaBundle-2.3.0.tgz and rosetta_fragments-2.3.0.tgz) to the installation directory (for example $ROSETTA_DIR), (3) uncompress RosettaBundle-2.3.0.tgz, all four generated sub-packages (with names of rosetta*.tgz) and rosetta_fragments-2.3.0.tgz, (4) go to the rosetta++ directory and type "make gcc" (for Linux installation) to compile the ROSETTA source codes. After successful compilation, an executable ROSETTA file with default name "rosetta.gcc" will be generated.

Use of the 'hybrid' fragment selection procedure requires a set of initial fragment candidates selected by the standard ROSETTA fragment selection procedure, which can be performed locally with a perl script make_fragments_2000.pl (modified from the script make_fragments.pl of the ROSETTA fragment package). To run this script, several other programs, such as PSI-BLAST, PSIPRED, JUFO, PROFphd and/or SAM, are required to be installed and properly configured. Please see here (ROSETTA website) for the details on how to properly install those required programs and use this script. (Note: a set of database files specifically prepared for the 'hybrid' fragment selection method are required and can be downloaded here)

An instruction of ROSETTA 3.X installation can be found here.

Installation of CS-ROSETTA

Download all above files to the installation directory ($baseDir), type 'install.com' to start the installation. The correctly installed CS-ROSETTA program contains the following contents in the installation directory $baseDir:

./com

CS-ROSETTA scripts directory

Adjust_CS_Offset.com

Add an offset to a given type of chemical shift

bmrb2fasta.com

Generate a FASTA sequence file from a BMRB chemical shift file

bmrb2talos.com

Convert a BMRB chemical shift file to TALOS-format

correctCACB_D_effect.com

Apply deuteration effect corrections for 13CA and 13CB chemical shifts

cs2fasta.com

Generate a FASTA sequence file from the header of a TALOS chemical shift table file

csrosettaInit.com

CS-ROSETTA initialization script

extract_pdb.com

Generate PDB full-atom coordinates files from a ROSETTA(2.x) silent output file

extract_lowscore_decoys.py

Python script to extract N lowest energy models from a ROSETTA silent output file to a new silent output file

fasta2pdb.com

Generate a 'dummy'-PDB file from a FASTA sequence file, which is required by MFR as the reference structure

make_fragments_2000.pl

Generate 2000 initial ROSETTA fragments (used by runCSRjob_hybrid.com for a hybrid fragment selection module procedure; modified from the original script in the ROSETTA_fragment package)

mfr2rosetta.com

Convert a typical MFR-format fragment file to ROSETTA(2.x) format

mfr.tcl

Modified MFR starting script from the original 'mfr.tcl' in NMRPipe, required by CS-ROSETTA

mfr.com

Conduct MFR fragments search

paths.txt

Template of "paths.txt" file for ROSETTA

pdb2fasta.com

Generate a FASTA sequence file from a PDB file

pdbrms

Calculate CA-rmsd for a set of PDB coordinates

renumber_pdb.com

Renumber sequence number for a PDB coordinate file

rosetta2GDB.com

Convert a ROSETTA(2.x) fragment file to a generic table file

rosettaFrag2csFrag.com

Apply chemical shift scoring for a set of ROSETTA fragment candidates

runCSRjob.com

Run MFR fragment search and prepare inputs for the next step of ROSETTA(2.x) fragment assembly

runCSRjob_hybrid.com

Run fragment search using a hybrid method and prepare inputs for the next step of ROSETTA(2.x) fragment assembly

runRosetta.com

Template of ROSETTA(2.x) starting script

runCSRescore.com

Re-score ROSETTA (2.x) full-atom models using experimental chemical shifts

runRosettaRescore.com

Extract and rescore segments from ROSETTA(2.x) full-atom models

sparta

Starting script for program SPARTA

talos

Starting script for program TALOS (a 'fast' C++ version)

./PDBH

Directory of PDB coordinates database

./CS

Directory of SPARTA-assigned chemical shifts database

./ANGLESS

Directory of secondary structure classification and ROSETTA-idealized backbone torsion angles database

./src

Directory of supporting programs

SPARTA

Directory for the chemical shifts prediction program SPARTA

TALOS

Directory for the backbone torsion angle prediction program TALOS (a 'fast' C++ version)

mfr2rosetta

Directory for a C++ program to convert MFR-format fragments to ROSETTA-format

nnmake

Directory for a Fortran program to generate the initial ROSETTA fragment candidates (modified from the original program in the ROSETTA_fragment package)

pdbrms

Directory for a C++ program to calculate RMSD value between two sets of PDB coordinates

rosettaFrag2csFrag

Directory for a C++ program to calculate chemical shift score for a set of ROSETTA fragment candidates

./example

Directory of a complete example for protein GB3

./input/gb3.*

GB3 input experimental chemical shift table (.tab) and reference experimental structure (.pdb)

./output_hybrid/

All sample output files for GB3 CS-ROSETTA runs using the 'hybrid' fragment selection

./output_rosetta2/

All sample output files for GB3 CS-ROSETTA runs using the ROSETTA2.x

./output_rosetta3/

All sample output files for GB3 CS-ROSETTA runs using the ROSETTA3.x

./output_rosetta_FnD/

All sample output files for GB3 homodimer CS-ROSETTA runs using the ROSETTA2.3.0

runCSRtest.com

Script to test CS-ROSETTA installation

The initialization script csrosettaInit.com includes the definitions for all environmental variables required by CS-ROSETTA. Please check if the variables $rosettaDir and $csrosettaDir defined in csrosettaInit.com:

   setenv rosettaDir /home/software/ROSETTA
   setenv csrosettaDir /home/software/CSROSETTA

are correctly configured according to the ROSETTA and CS-ROSETTA installation directories.

In order to use the package, users MUST first execute the above initialization script, for instance by adding the following command to their .cshrc file:

   if (-e $baseDir/com/csrosettaInit.com) then
      source $baseDir/com/csrosettaInit.com
   endif

If the package is installed successfully and environmental variables are set up correctly, the script in the example directory (examples/runCSRTest.com) should work successfully.

Top

 

Preparation of input chemical shift table

CS-ROSETTA utilizes the protein backbone 15N, 1HN, 1HA, 13CA, 13CB and 13C' chemical shifts as inputs to search a structural database for best matched fragments. The chemical shifts need to be in TALOS format, as defined at http://spin.niddk.nih.gov/NMRPipe/talos/#preparing shifts, and contain ONLY backbone 15N, 1HN, 1HA, 13C', 13CA and 13CB shifts. If starting from a standard BMRB format chemical shift file, a C-shell script $baseDir/com/brmb2talos.com can be used to generate a TALOS-format chemical shift file using:

   brmb2talos.com bmrbCS.str > inCS.tab
An example of the chemical shifts input format can also be found in the file $baseDir/examples/input/gb3.tab:
   DATA SEQUENCE MQYKLVINGK TLKGETTTKA VDAETAEKAF KQYANDNGVD GVWTYDDATK
   DATA SEQUENCE TFTVTE

   VARS RESID RESNAME ATOMNAME SHIFT 
   FORMAT %4d %1s %4s %8.3f

      1 M   CA   54.519
      1 M   CB   29.320
      1 M   HA    4.189
      2 Q    C  174.318
      2 Q   CA   55.632
      2 Q   CB   30.865
      2 Q   HN    8.347
      2 Q   HA    5.109
      2 Q    N  123.775
      ...
Note that missing chemical shift data are allowed, but the amino acid sequence shown in the header MUST be the full sequence of the protein and MUST start from residue #1, CS-ROSETTA will generate protein structures with the sequence defined in the header.
 

 Identification and exclusion of flexible tails and loops

Residues from the disordered tails and loops are identified from the input chemical shifts and the criteria in the paper:
(1) S2 < 0.7 from the RCI analysis, and
(2) no "good" predictions using the program TALOS

A RCI analysis can be performed from the RCI server or from running a TALOS+ analysis .

It is recommended to excluded positively identified flexible residues in the N- and C-terminal tails from the structure prediction by preparing a 'truncated' input chemical shift table file that excludes data (both sequence and chemical shifts) from those residues (Note that the rest residues must be re-numbered so that the 'truncated' sequence start from residue #1). For long flexible loops, it is advantageous to exclude all terms from the ROSETTA full-atom energy and the chemical shifts rescoring (see details in Evaluation of CS-ROSETTA models)

Top

 

How to use CS-ROSETTA (with Rosetta2.x)

CS-ROSETTA has been designed to work in a black-box manner, and is supported by multiple C-shell/C++/python/perl scripts/programs. To use CS-ROSETTA for protein structure prediction, users can simply follow a three-step procedure:

  1. Fragment selection. Run runCSRjob.com inCS.tab or runCSRjob_hybrid.com inCS.tab, where inCS.tab  is the input experimental chemical shift file, to select fragments from the structure database and to prepare a starting package for the subsequent ROSETTA structure generation
  2. Fragment assembly and structure generation. Run runRosetta.com (prepared in the previous step) to perform ROSETTA structure prediction and generate full-atom structures
  3. Structure evaluation and selection. Run runCSrescore.com models models.out inCS.tab , where models.out is the silent output file of the ROSETTA models generated in the previous step, to re-evaluate generated ROSETTA structures by adding a chemical shift term to the CS-ROSETTA full-atom energy, and select the lowest energy structures upon the convergence is confirmed

csRosetta flowchart 2

For each step, the details are provided as following:

1. Fragment selection

1.1 MFR method - runCSRjob.com

The script runCSRjob.com is the master script to generate MFR fragment candidates. To use this script, an input chemical shift filename MUST be specified in a command line such as:

   runCSRjob.com inCS.tab

The following optional variables are hard-coded in script runCSRjob.com:

   set OUTPUT_DIR = rosetta
   set MFR_EXCL = "PDB1_ PDB2A PDB3X"

   set PDB_NAME = t000
   set CHAIN = _
   set TAG = aa
   set N_MODEL = 5000 

where $OUTPUT_DIR is the output directory of the fragments and scripts for running ROSETTA (or the ROSETTA running directory), $MFR_EXCL is the list of the proteins with homologous sequence and known structural homologs to the target protein, which (for test purpose) users may wish to exclude from the MFR structural database prior to MFR fragment searching, $PDB_NAME and $CHAIN are the 4-letter and 1-letter dummy names indicating the protein name and chain ID, respectively, $TAG is a 2-letter dummy tag (used by ROSETTA), $N_MODEL is the total number of models to be predicted.

The output of running this script will be the MFR-selected 9-residue and 3-residue fragment candidates (with file names of $OUTPUT_DIR/$TAG$PDB_NAME$CHAIN09_05.200_v1_3 and $OUTPUT_DIR/$TAG$PDB_NAME$CHAIN03_05.200_v1_3, respectively), a FASTA sequence file ($OUTPUT_DIR/$PDB_NAME$CHAIN.fasta), a ROSETTA paths definition file ($OUTPUT_DIR/paths.txt) and a script file ($OUTPUT_DIR/runRosetta.com) for starting ROSETTA structure generation.

It is recommended for users to read through and understand this script, which will facilitate use of the MFR program and other supporting scripts, and solve any problems during the MFR search (for example, by checking the intermediate and log files). This script (1) checks input chemical shifts and prepares input files for the MFR program, (2) performs the MFR fragment search, (3) converts the selected fragments to ROSETTA format and (4) prepares inputs and script for running ROSETTA structure generation. In detail:

(1) Check input chemical shifts and Prepare MFR inputs

The current MFR program in NMRPipe requires a TALOS format chemical shift table file as input (see Section of "Preparing of chemical shift table"). A pre-check step will be performed first to check for possible chemical shift referencing problems and chemical shift errors/outliers.  The chemical shifts with identified referencing problem, for example a 2.7ppm referencing offset for the 13CA chemical shifts, can be corrected by using:

   Adjust_CS_Offset.com inCS.tab CA -2.7

A "reference PDB" coordinate file is needed to provide the MFR program with the protein's amino acid sequence. For this purpose, runCSRjob.com creates a dummy "reference PDB" coordinate file from its FASTA sequence using:

   fasta2pdb protein.fasta > dummy_ref.pdb

(2) Run MFR fragment search

The script runCSRjob.com then conducts a MFR fragment search following a command line of:

   mfr.com -cs inCS.tab -ref ref.pdb -out out.tab -excl $MFR_EXCL >& log

where $MFR_EXCL is a list of protein names (without .pdb suffix) that users like to exclude from the MFR structural database prior to the fragment search. MFR fragment selection from searching the structural database is a time-consuming job (1-3 hours); the 'log' file can be used to check the progress.

(3) Convert fragments to ROSETTA format

The MFR-selected 9-residues and 3-residues fragment candidates, generated after running the MFR fragment search, have file names frag9.$PDB_NAME.mfr.tab and frag3.$PDB_NAME.mfr.tab, respectively. Script mfr2rosetta.com can be utilized to convert these fragments to standard ROSETTA format:

   mfr2rosetta.com -mfr frag9.$PDB_NAME.mfr.tab -segLength 9 > frag9.$PDB_NAME.rosetta.tab
   mfr2rosetta.com -mfr frag3.$PDB_NAME.mfr.tab -segLength 3 > frag3.$PDB_NAME.rosetta.tab

(4) Prepare ROSETTA running package

runCSRjob.com generates at the last step a new directory (defined by variable $OUTPUT_DIR) and prepares the following required inputs and script for running ROSETTA fragment assembly and structure generation:

  • 9-residues and 3-residues fragment files $TAG$PDB_NAME$CHAIN09_05.200_v1_3 and $TAG$PDB_NAME$CHAIN03_05.200_v1_3

  • FASTA sequence file $PDB_NAME$CHAIN.fasta

  • ROSETTA paths definition file paths.txt

  • starting script file runRosetta.com for ROSETTA structure generation

 

1.2 Hybrid method - runCSRjob_hybrid.com

For proteins with incompleted/imperfect chemical shift assignments, a hybrid fragment selection method is recommended for selecting fragment candidates [click here to see the reference]. The script runCSRjob_hybrid.com is the master script to generate fragment candidates with the hybrid method. To use this script, an input chemical shift table file (in TALOS format) MUST be specified in a command line such as:

   runCSRjob_hybrid.com inCS.tab

The following variables are hard-coded in this script analogous to those in the script runCSRjob.com.:

   set OUTPUT_DIR = rosetta
   set PDB_NAME = t000
   set CHAIN = _
   set TAG = aa
   set N_MODEL = 5000 

The output of running this script will be the 9-residue and 3-residue fragments (with file names of $OUTPUT_DIR/$TAG$PDB_NAME$CHAIN09_05.200_v1_3 and $OUTPUT_DIR/$TAG$PDB_NAME$CHAIN03_05.200_v1_3, respectively), a FASTA sequence file ($OUTPUT_DIR/$PDB_NAME$CHAIN.fasta), a ROSETTA paths definition file ($OUTPUT_DIR/paths.txt) and a script file ($OUTPUT_DIR/runRosetta.com) for starting ROSETTA structure generation.

This script (1) checks input chemical shifts, (2) performs a conventional ROSETTA fragment search for the initial fragment candidates, (3) performs a "MFR" fragment search for the final fragment candidates and (4) prepares inputs and script for running ROSETTA structure generation. In detail:

(1) Pre-check chemical shifts

Similar to the script runCSRjob.com, a pre-check step will be performed first to check for the possible chemical shift referencing problems and chemical shift errors/outliers.  A FASTA sequence file (with filename of t000_.fasta by default) will be generated for the next step.

(2) Run initial ROSETTA fragment search

The script runCSRjob_hybrid.com then conducts a "standard" ROSETTA fragment search following a command line of:

   make_fragments_2000.pl t000_.fasta >& make_fragments_2000.log

where make_fragments_2000.pl is a perl script to generate 2000 ROSETTA fragment candidates for each overlapped 3-residue and 9-residue target fragment. The output of running this command line will be two files: $TAG$PDB_NAME$CHAIN03_05.000_v1_3 and $TAG$PDB_NAME$CHAIN03_05.000_v1_3.

An option of "-nohoms" can be used in order to omit homologs (with a PSI-BLAST e-score < 0.05) from the search:

   make_fragments_2000.pl -nohoms t000_.fasta >& make_fragments_2000.log

Note that running script make_fragments_2000.pl on the low profile computer may fail due to the requirement of massive memory space.

(3) Run "MFR" fragment search

The script rosettaFrag2csFrag then will be used to calculate the chemical shift scores for the above initial fragment candidates:

   rosettaFrag2csFrag -rosetta $FRAG9_FNAME.tab -cs inCS2.tab -segLength 9 -out $FRAG9_FNAME.csScore.tab -noverb

where $FRAG9_FNAME is the filename of the initial ROSETTA fragment candidates, $inCS2.tab stands for the input secondary chemical shift table file prepared in step (1). The output of running this command line will be a new fragment candidate file $FRAG9_FNAME.csScore.tab containing a chemical shift score for each fragment candidate.

The final 200 fragment candidates with the best chemical shift scores are kept for each overlapped 3-residue and 9-residue target fragment.

(4) Prepare ROSETTA running package

Similar to the script runCSRjob.com, the script runCSRjob_hybrid.com generates at its last step a new directory (defined by variable $OUTPUT_DIR) and prepares the required inputs and script for running ROSETTA fragment assembly and structure generation.

 

2. Protein structure generation - runRosetta.com

The script runRosetta.com generated by running runCSRjob.com or runCSRjob_hybrid.com can be used to start the standard ROSETTA structure generation (fragment assembly and full-atom relaxation). This script includes solely a standard ROSETTA command line defining the inputs and parameters for fragment assembly and full-atom relaxation

   $rosetta $TAG $PDB_NAME $CHAIN -silent -output_silent -constant_seed -jran $rand -increase_cycles 10 
   -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx 
   -ex1 -ex2 -termini -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -rg_reweight 0.5
   -rsd_wt_helix 0.5 -rsd_wt_loop 0.5 -output_all -accept_all -do_farlx_checkpointing -relax_score_filter 
   -record_irms_before_relax -acceptance_rate 1.0 -filter1a 10000 -filter1b 10000 -nstruct $N_MODEL

where $TAG, $PDB_NAME, $CHAIN (variables for protein identities) and $N_MODEL (total number of predicted models) are automatically replaced by their values defined in the script runCSRjob.com.

Typing runRosetta.com in the working directory (where the files of the selected fragment candidates, FASTA sequence and paths.txt are stored) will start the standard ROSETTA structure prediction. The output of the ROSETTA run is a so-called silent output file with name $TAG$PDB_NAME.out, which includes the scores and full-atom descriptions for all accepted models.

The ROSETTA fragment assembly and relaxation procedure is computationally demanding (on the order of 5-10 minutes per model). Therefore, use of computer clusters is highly recommended. In order to run multiple ROSETTA jobs in parallel for a given project, users can simply run the same script runRosetta.com for each 'parallel' job in the same working directory; ROSETTA will output the results from each job to a single silent output file.

 

3. Evaluation of CS-ROSETTA models - runCSrescore.com

A ROSETTA silent output file $TAG$PDB_NAME.out contains a header in its first two lines (first line: the sequence information; second line: definition of scores and values), and for each model, a score line (which starts with "SCORE:") and the residue-specific description of this model. An example ROSETTA silent output file is shown below:

SEQUENCE: MQYKLVINGKTLKGETTTKAVDAETAEKAFKQYANDNGVDGVWTYDDATKTFTVTE
SCORE: score env pair vdw hs ss sheet cb rsigma hb_srbb hb_lrbb rg co contact rama bk_tot fa_atr fa_rep fa_sol h2o_sol hbsc fa_dun fa_intra fa_pair fa_plane fa_prob fa_h2o h2o_hb gsolt sasa omega_sc description
SCORE: -116.07 -12.24 -1.62 0.31 -3.25 -72.74 0.34 15.99 -28.93 -18.75 -29.62 11.42 17.69 0.00 -13.19 -134.45 -165.57 7.34 84.55 0.00 -3.13 22.82 0.07 -3.11 0.00 -15.74 0.00 0.00 45.19 3958.98 7.55 S_0001_9019
1 E -81.616 95.109 -180.850 0.000 0.000 0.000 -60.979 -52.601 -70.896 0.000 S_0001_9019
2 E -103.775 140.428 181.156 0.662 1.438 3.457 -61.314 178.939 7.568 0.000 S_0001_9019
3 E -128.914 145.607 180.081 -1.325 0.472 6.551 -62.403 85.133 0.000 0.000 S_0001_9019
...

Before rescoring ROSETTA full-atom models using the experimental chemical shifts, it is recommended to extract the models with low ROSETTA energy and 'discard' the 'bad' models with high ROSETTA energy. This expedites further analysis and can be performed by a simple command line:

   extract_lowscore_decoys.py $TAG$PDB_NAME.out N > new_silent_file.out

where the script extract_lowscore_decoys.py (courtesy of whip.bakerlab.org) is used to extract the N lowest-energy models (the energy of a model is defined as the first number in the SCORE line of a silent output file) from a silent file. The output is a new silent output file containing only the selected N lowest-energy models.

If residues in flexible loops are positively identified (e.g., by RCI analysis), the energy terms for those residues should be excluded from the ROSETTA full-atom energy. For this purpose, the energy of all models in a ROSETTA silent output file can be rescored by a script runRosettaRescore.com and a command line of:

   runRosettaRescore.com $TAG$PDB_NAME.out ref.pdb

which basically runs a single standard ROSETTA command line:

   ${rosetta} -extract_segment -segments seg.txt -n ref.pdb -s $TAG$PDB_NAME.out -fullatom -fa_input -all -termini

where "ref.pdb" is a reference PDB coordinate file and required by standard ROSETTA "-extract_segment" module (the C-alpha RMSD values relative to this structure will also be calculated during the re-scoring; for instance, the full atom PDB coordinate file extracted from the lowest energy ROSETTA model can be used here; see "Full-atom PDB coordinates extraction"); "seg.txt" is a text file defining the segment for which ROSETTA will calculate the new full-atom energy. Here is an example of "seg.txt":

   6 60 80 99 120 130

which tells ROSETTA that only residues 6-60, 80-89 and 120-130 are kept and used for calculating the new full-atom energy for all models; this file MUST locate in the current directory. The output is a new ROSETTA silent file with file name of $TAG$PDB_NAME_segment.out.

The script runCSrescore.com can be used to apply the chemical shift based "energy-rescoring" for all ROSETTA full-atom models, using the following command line:

   runCSrescore.com silent_file.out inCS.tab

where inCS.tab is the initial experimental chemical shift input file (the same input file used by the script runCSRjob.com for the fragment selection procedure). This script:

  1. extracts full-atom PDB models from the ROSETTA silent output file silent_file.out

  2. for each full-atom PDB model: runs SPARTA chemical shift predictions; calculates the chi2 values between the SPARTA-predicted chemical shifts and the input experimental chemical shifts (stored in inCS.tab); stores the chi2 values as a table file CS_chi2.txt

  3. collects the raw all-atom energies from silent_file.out and generates a table file with name  name.rawscore.txt, applies corrections to raw energies using the chemical shift scores collected in (2) and generates a table file with name name.rescore.txt containing the rescored energy information.

  4. calculates the RMSD deviation relative to the model with the lowest re-scored full-atom energy and generates a final output table file with name name.rms.rescore.txt.

Details of each step include:

(1) Full-atom PDB coordinate extraction

To rescore the ROSETTA full-atom models, the full-atom PDB coordinates are required to be 'extracted' using their full-atom description encoded in the silent output file, which can be done by using a script extract_pdb.com with a command line such as:

   extract_pdb.com silent_file.out

This script actually runs the following ROSETTA command to generate full-atom PDB coordinates from the information in silent output file 'silent_file.out':

   rosetta -extract -s silent_file.out -fa_input -all -termini -write_atoms_only

Note that the generated PDB coordinates will be stored in a directory defined in the paths.txt file. The extracted PDB coordinate files will be named according to their 'description' labels defined in the silent output.

   OUTPUT PATHS:
   movie ./
   pdb ./

(2) Calculate predicted chemical shifts and "chemical shift scores" for ROSETTA full-atom models

The ROSETTA full-atom models are evaluated by comparing the initial experimental chemical shifts with the SPARTA-predicted chemical shifts for the models. SPARTA takes the standard PDB coordinate for proteins and predicts the backbone 15N, 1HN, 1H , 13CA, 13CB and 13C' chemical shifts using the following command line:

   sparta -in inPDB.pdb -ref inCS.tab

where inPDB.pdb contains the input PDB coordinates file, inCS.tab is the input experimental chemical shifts (the same input as used for the MFR fragment search). The output is a file with predicted chemical shifts, and the chi2 value between the predicted and experimental chemical shifts will be also calculated.

The script runCS_rescore.com runs SPARTA chemical shift prediction for all full-atom PDB coordinates generated above, and all outputs will be stored in a default directory "./pred".

(3) Collect energy score and apply correction using chemical shift chi2 value

For each full-atom model, its name and raw ROSETTA full-atom energy score are extracted from the ROSETTA silent output file silent_file.out and stored in a file "name.rawscore.txt". Next, the energy score will be 'corrected' by using the corresponding chemical shift chi2 value (stored in file "CS_chi2.txt"). The new file with the model names and re-scored energy will be "name.rescore.txt", an example of this file contains the following contents:

   # name raw_energy chi2 rescore_energy
   S_0001_9019 -116.07 251.854 -53.1065
   S_0002_6198 -124.99 140.1 -89.965
   S_0003_5600 -123.80 115.019 -95.0452
   S_0004_2308 -118.78 148.107 -81.7533
   S_0005_3837 -111.47 75.3417 -92.6346
   S_0006_9327 -119.05 110.179 -91.5052
   ...

(4) Calculate RMSD values to the lowest energy model

The C_alpha RMSD values between each model and the model with the lowest re-scored energy are calculated using the script pdbrms. pdbrms is a C++ program used to calculate the coordinate RMSD values between one PDB format structure and a set of protein PDB coordinates. For example, to calculate the C-alpha RMSD between ref.pdb and 1.pdb, 2.pdb, 3.pdb, the following command line can be used:

   pdbrms ref.pdb 1.pdb 2.pdb 3.pdb

The script runCSrescore.com will identify the model with the lowest rescored energy, and calculate the C_alpha RMSD values to this model for all models in the ./output directory. The final output file "name.rms.rescore.txt" will contain the model names, C-alpha RMSD values and re-scored energies:

   # name rmsd rescored_energy
   S_0001_9019 1.883 -53.1065
   S_0002_6198 1.384 -89.965
   S_0003_5600 0.984 -95.0452
   S_0004_2308 2.788 -81.7533
   S_0005_3837 2.718 -92.6346
   S_0006_9327 1.943 -91.5052
   ...

By default, the ROSETTA running directory (./rosetta) contains the following contents upon finishing the above CS-ROSETTA protein structure generation:

aat000_09_05.200_v1_3 9-residues fragment input for ROSETTA
aat000_09_05.200_v1_3 3-residues fragment input for ROSETTA
t000_.fasta FASTA sequence
paths.txt ROSETTA paths definition file
runRosetta.com starting script to run ROSETTA
aat000.out ROSETTA output silent file
./output Directory for full-atom PDB coordinates

S_*.pdb

Extracted full-atom PDB coordinates

pred

Directory for SPARTA chemical shifts prediction summaries

name.rawscore.txt Output table file of model names and raw ROSETTA full-atom energies
name.rms.rawscore.txt Output table file of model names, C_alpha RMSD values (to the lowest raw-energy model) and raw ROSETTA full-atom energies
CS_chi2.txt Output table file of chi2 values calculated between the SPARTA-predicted and experimental chemical shifts for each model
name.rescore.txt Output table file of model names and re-scored ROSETTA full-atom energies
name.rms.rescore.txt Output table file of model names, C_alpha RMSD values (to the lowest rescored-energy model) and re-scored ROSETTA full-atom energies
rms2LowRawscore.txt Output table file for model names and RMSD values relative to the model with the lowest raw energy
rms2LowRescore.txt Output table file of model names and C_alpha RMSD values relative to the model with the lowest re-scored energy

Top

 

 How to use CS-ROSETTA (with Rosetta3.x)

Using Rosetta 3.x versions requires different input data format and different command options. 

Installation of ROSETTA (3.0/3.1/3.2)

In order to install a ROSETTA3.x program, users need to (1) register for a license at http://www.rosettacommons.org/software/, (2) download (at least) the required packages (for example, rosetta3.2.1_Bundles.tgz and rosetta3.2.1_fragments.tgz for Rosetta3.2.1) to the installation directory (for example $ROSETTA_DIR), (3) uncompress rosetta3.2.1_Bundles.tgz, all four generated sub-packages (with names of rosetta*.tgz) and rosetta3.2.1_fragments.tgz, (4) go to the rosetta_source directory and type "./scons.py bin mode=release" (for Linux installation) to compile the ROSETTA3.x source codes. After successful compilation, numbers executable ROSETTA files will be generated, links to those files can be found at /rosetta_source/bin/ directory.

The document of the latest version of Rosetta3.2.1 is available at: http://www.rosettacommons.org/manuals/archive/rosetta3.2.1_user_guide.

 

Setup of ROSETTA3.x in CS-ROSETTA

If ROSETTA3.x is installed in a default way (i.e., with a database directory of $ROSETTA_DIR/rosetta_database, a source directory of $ROSETTA_DIR/rosetta_source and a bin directory of $ROSETTA_DIR/rosetta_source/bin for the links to the compiled executable files), the installation script install.com can be used to install CS-ROSETTA program and automatically setup the ROSETTA3.x in CS-ROSETTA.

Otherwise, the following ROSETTA3.X configuration lines in the initialization script $baseDir/com/csrosettaInit.com need to be manually set up :

setenv rosetta3Dir /home/software/rosetta3.1/
setenv rosetta3 ${rosetta3Dir}/rosetta3_source/bin/AbinitioRelax.linuxgccrelease
setenv rosetta3_extpdbs ${rosetta3Dir}/rosetta3_source/bin/extract_pdbs.linuxgccrelease
setenv rosetta3DB ${rosetta3Dir}/rosetta_database

 

Fragment selection (runCSRjob3.com)

An additional script runCSRjob3.com is provided to generate MFR fragment inputs for ROSETTA3.x. Use of this script is similar to runCSRjob.com script. The output contains:

  • a 9-residue fragment file $TAG$PDB_NAME$CHAIN09.200_R3 and a 3-residue fragment file $TAG$PDB_NAME$CHAIN03.200_R3

  • a FASTA sequence file $PDB_NAME$CHAIN.fasta

  • a starting script file runRosetta3.com for ROSETTA3.x structure generation

 

Structure generation (runRosetta3.com)

The script runRosetta3.com generated in the previous step can be used to run ROSETTA3.x and generate protein structure. Similar to the runRosetta.com script, this script includes solely a standard ROSETTA3.x command line defining the inputs and parameters for fragment assembly and full-atom relaxation.

${rosetta3} -database ${rosetta3DB} -in::file::frag3 $FRAG3_NAME -in::file::frag9 
$FRAG9_NAME -in::file::fasta $FASTA_NAME -abinitio::use_filters false -increase_cycles 
10 -rsd_wt_helix 0.5 -rsd_wt_loop 0.5 -rg_reweight 0.5 -abinitio::fastrelax -score::weights 
score13_env_hb -out::nstruct $N_MODEL -user_tag j001

Note that ROSETTA3.x has different naming method for the generated structures compared to ROSETTA2.x, all ROSETTA3.x structures generated by a given run will be named as S_TAG_AAAAAAAA,  where "TAG" is a user defined tag (by "-user_tag" option) and "AAAAAAAA" is a 8-digital index number. Running multiple ROSETTA jobs in parallel for a given project is also different for ROSETTA3.x, users are still able to run the same script runRosetta3.com for each 'parallel' job in the same working directory, but the -use_tag option MUST be different for each job. ROSETTA3.x will still output the structures from different jobs running in the same working directory to a single silent output file (default.out), any two ROSETTA3.x jobs with identical -user_tag option will generate structures with identical name.

 

Evaluation of CS-ROSETTA models (runCSrescore3.com)

A separated script runCSrescore3.com is provided to evaluate the protein structure generated by ROSETTA3.x. This script can be used in a same manner as the script runCSrescore.com for evaluation of ROSETTA2.x generated structures [see details here]. Note that ROSETTA3.x currently is not able to calculate the energy scores for part of the protein, so the final chemical shift rescored ROSETTA(3.x) energy will be calculated based on the ROSETTA(3.x) raw energy of all residues.

Top

 

How to select CS-ROSETTA models

Criteria for convergence and accepting models

After finishing CS-ROSETTA structure generation, users have to decide whether the ROSETTA models are acceptable. For this purpose, it is convenient to plot the "landscape" of (re-scored) ROSETTA full-atom energies of all models with respect to their C_alpha RMSD values relative to the lowest-energy model, using the data stored in a file "name.rms.rescore.txt".

  1. If the (10) lowest energy models all differ by less than 2 angstrom C_alpha RMSD from the model with the lowest (re-scored) energy (see the following example plot from the structure prediction of protein GB3), the structure prediction is deemed successful and the 10 lowest energy models are accepted. Although results where clustering around the lowest energy structure is less tight than 2 angstrom may still be useful for further analysis, such results should not be over-interpreted and could be in error.
      Result for gb3
  2. If no clustering around low energy models is observed (see the following example plot generated for protein nsp1), the structure prediction has not converged and the low energy models can not be accepted at this stage.
    Result for nsp1

 

Number of models required

By using the current method implemented in CS-ROSETTA package, 5,000 to 20,000 predicted CS-ROSETTA models are generally required to obtain convergence. For small proteins (<= 90-100 amino acids), 1,000 to 5,000 CS-ROSETTA models often suffice. ROSETTA takes about 5-10 minutes to calculate one all-atom model on a single 2.4GHz CPU.

Top

 

 How to use CS-ROSETTA for symmetric homomultimers

As described in the paper:
Simultaneous prediction of protein folding and docking at high resolution

Rhiju Das et al.,
Proc Natl Acad Sci USA, (2009) 106,18978-18983

ROSETTA now can apply the simultaneous prediction of protein folding and docking for symmetric homomultimers. CS-ROSETTA is updated to provide an interface to use this "fold-and-dock" feature (currently only available for CS-ROSETTA with ROSETTA 2.3.0; check here for how to run "fold-and-dock" application in ROSETTA3.2.1) and generate structures for (small) home-dimers and home-tetramer from their chemical shifts.

 

Fragment selection

The script runCSRjob.com is updated to generate fragment inputs and prepare ROSETTA running script for folding and docking a symmetric homomultimer, An newly added option $SYMM in the runCSRjob.com script must to be firstly defined according to the symmetry of the homomultimer to be modeled:

set SYMM = "C2" # for a C2 homodimer
set SYMM = "C4" # for a C4 homotetramer
set SYMM = "D2" # for a D2 homotetramer

The output contains:

  • 9-residues and 3-residues fragment files $TAG$PDB_NAME$CHAIN09_05.200_v1_3 and $TAG$PDB_NAME$CHAIN03_05.200_v1_3

  • FASTA sequence file $PDB_NAME$CHAIN.fasta

  • ROSETTA paths definition file paths.txt

  • starting script file runRosetta.FnD.$SYMM.com for ROSETTA structure generation

 

Structure generation (runRosetta.FnD.$SYMM.com)

The script runRosetta.FnD.$SYMM.com generated in the previous step can be used to run ROSETTA 2.3.0 and generate structure for homomultimers.

The ROSETTA "folding-and-docking" procedure is more computationally demanding (on the order of hours per model for small homomultimers).

 

The following steps of chemical shift rescoring and model selection are similar to those of the CS-ROSETTA runs for a monomeric protein.

Top

 

 How to use CS-ROSETTA with RDC data

RDC data now can be used to assist the ROSETTA(3.X)/CS-ROSETTA structure generation in together with the NMR chemical shifts.  In order to run a chemical shift based CS-ROSETTA structure generation with the additional RDC data, please refer to "How to use CS-Rosetta with Rosetta 3.X", and replace the ROSETTA3.X command line in the runRosetta3.com script with:

${rosetta3} -database ${rosetta3DB} -in::file::frag3 $FRAG3_NAME -in::file::frag9 
$FRAG9_NAME -in::file::fasta $FASTA_NAME -abinitio::use_filters false -increase_cycles 
10 -rsd_wt_helix 0.5 -rsd_wt_loop 0.5 -rg_reweight 0.5 -abinitio::fastrelax -score::weights score13_env_hb 
-in::file::rdc rdc_file 
-abinitio::stage2_patch rdc_patch
-abinitio::stage3a_patch rdc_patch
-abinitio::stage3b_patch rdc_patch
-abinitio::stage4_patch rdc_patch
-out::nstruct $N_MODEL -user_tag j001

(Note that the above command line must be written in a single line!)

where the rdc_file is the raw RDC file (currently only the backbone H-N RDCs can be used) with a format of: 

   2  N    2  H  4.800
   3  N    3  H 10.220
   5  N    5  H 27.130
   6  N    6  H 21.608

rdc_patch is a one-line file that contains the RDC weight set "rdc = 1.0".
 

Top

 

 

FAQs
How to use known disulfide bond information in CS-ROSETTA?

Answer
In order to include the known disulfide bond information for the CS-ROSETTA structure prediction, an additional option " - in:fix_disulf disulf.cst" needs to be added to the ROSETTA3.x command line in the runRosetta3.com script, where "disulf.cst" is a disulfide bond definition file (and must be located in the same working directory unless its full path is provided as well). An example "disulf.cst" file containing a definition of two disulfide bonds (between residues Cys6 and Cys88 and between Cys38 and Cys68) is listed below:

6 88
38 68
 

Why no (or no sufficient) fragment candidates obtained after MFR fragment search?

Answer

The current MFR program (implemented in NMRPipe) searches for fragment candidates with a default threshold of 3.0 for chemical shift matching score (-csThresh 3.0), which is good for most cases. If the experimental chemical shifts involving in a given target fragment are highly unusual, there will be no (or no sufficient) fragment candidates could be selected. When this happens, please first check the chemical shift referencing and make sure that the experimental chemical shifts involved in those "problematic" target fragments are correct. (Script runCSRjob.com will list the number of selected fragment candidates for each target fragment, and the "problematic target fragment" will be listed with a warning message)
Users can "force" the MFR program to generate fragment candidates using a larger csThresh value, for example 5.0, by:

  1. copy script runCSRjob.com to current directory
  2. replace the line
    "mfr.com -cs ${CS_NAME} -ref ${REF_NAME} -out ${PDB_NAME}.mfr.tab -excl ${MFR_EXCL}>& ${PDB_NAME}.mfr.log"
    with
    "mfr.com -cs ${CS_NAME} -ref ${REF_NAME} -out ${PDB_NAME}.mfr.tab -excl ${MFR_EXCL} -csThresh 5.0>& ${PDB_NAME}.mfr.log"
  3. run runCSRjob.com inCS.tab to get fragment candidates

Note that those fragment candidates, which could only be obtained by a MFR fragment search with a larger csThresh value, are often not accurate.

Use of "H", instead of "HN", as the atom name for amide proton in the experimental chemical shift table file is another reason for no (sufficient) fragment candidates (a bug of current MFR module).

How to exclude protein(s) from the structural database during the fragment search?

Answer
For select fragment candidates using a MFR method:
  1. find the exact names of the proteins you like to exclude in the file $CSROSETTA/PDBH/resolution.tab
  2. copy script runCSRjob.com to current directory
  3. find the line set MFR_EXCL = "NONE_", and replace NONE_ with the list of protein names
  4. run runCSRjob.com inCS.tab to get fragment candidates.

For select fragment candidates using a hybrid method:

  1. copy script runCSRjob_hybrid.com to current directory
  2. find the line make_fragments_2000.pl $SEQ_NAME >& make_fragments_2000.log, and replace it with make_fragments_2000.pl -nohoms $SEQ_NAME >& make_fragments_2000.log
  3. run runCSRjob_hybrid.com inCS.tab to get fragment candidates

 

During the Rosetta2.x parallel jobs, how to avoid the same results (ROSETTA models) assembled from using the identical random seed numbers? 

Answer
If you run jobs in a cluster of parallel cpus, sometimes you will get exact same results. This is a general problem for Rosetta(2.x). The reason is that some cpus are starting from exactly the same random seed. One solution for this is to add -seed_offset followed by an int number to your command line, in this way, you can force each cpu to start from a different seed with that offset number. You can also try to run with -constant_seed -jran in multiple clusters, yes, the selection of the integers for the jran should be between 1 million and 4 million. The default value is 1111111.You can certainly choose different values if you want to submit jobs to different clusters.
- from "Rosetta Commons Forum" [link]

 

Top



[ Home ] [ NIH ] [ NIDDK ] [ Terms of Use ]
last update: Feb 11 2013 / Webmaster