CS-ROSETTA: CS-ROSETTA: System for chemical shifts based protein structure prediction using ROSETTA

How to Get CS-ROSETTA Package

A stable version CS-ROSETTA software package (version 2.x) can be downloaded below. This version is implemented in a different way to the previous version (see here), in order to apply the recent patches and to include the new developments. When downloading software from this website, you are agreeing to our Terms of Use, including the terms that there is no right to privacy on this system, and that the software from this website is not to be redistributed without permission from the authors. The CS-ROSETTA package provides the hardware & OS versions of linux9, and mac (see here for a definition of those hardware & OS versions by the NMRPipe system), and requires multiple Unix programs and other external software packages in order to use all its features and/or to perform it in an efficient way (see details below).

The most common CS-ROSETTA installation procedure on an Unix environment (linux9 and mac) will involve:

Create a directory for the CS-ROSETTA installation [for example, type mkdir /disk1/CSROSETTA in an "xterm" terminal window].

Go to the selected install directory [cd /disk1/CSROSETTA].

Download and store the CS-ROSETTA installation files (csRosetta.tZ, install.com) into the selected install directory

Via a web browser: Right-click the download links below and select "Save Target As", "Save Link As" or "Download Linked File (As)" (depending on the browser type), and save the files into the selected install directory (Be sure to retain the exact file name shown below).

Or via the unix command "wget":
wget https://spin.niddk.nih.gov/bax-apps/software/CSROSETTA/csrosetta.tZ

uncompress the package csrosetta.tZ:

tar -zxvf csrosetta.tZ

This will generate a "stand-alone" initialization script csrosetta_init.com, which stores all required and optional environment variables for running the program; it also recommend a common way to apply the initialization, i.e., adding the following lines to the ~/.cshrc file:
if (-e /disk1/CSROSETTA/csrosetta_init.com) then source /disk1/CSROSETTA/csrosetta_init.com endif Note: The environment variables in csrosetta_init.com script are required to be input and inspected manually by the users before run the program. Please check here to see a full list of the environment variables defined in this initialization script.

There is also a Web-Based version of CS-ROSETTA which can be used directly without installing CS-ROSETTA. However, due to our limited computing resources, the most time-consuming procedure of Rosetta structure generation is not provided by our server. CS-ROSETTA server will provide all required inputs/scripts for running the CS-ROSETTA structure generation, users have to run ROSETTA strture generation on their own. You can access this Web-based system, along with other facilities for manipulating chemical shifts, dipolar couplings, and molecular structures at the Bax Group NMR Server site:

CS-ROSETTA Installation Files

CS-ROSETTA Web Server

(Version 2.01 Rev 2019.06)
csrosetta.tZ [size: 2MB]

(2020 NMRBox Summer Workshop Examples)
csRosetta_demos.tar.Z [size: 34MB]

Other Software Programs

In order to run CS-ROSETTA structure generation procedure, multiple external software programs are needed, which are listed below:

Software

Note

URL

TALOS-N required to check and prepare inputs for CS-ROSETTA.

TALOS-N provides its outputs as the required inputs for generating fragment candidates

https://spin.niddk.nih.gov/bax-apps/software/TALOS-N/

ROSETTA required to prepare and run CS-ROSETTA.

Note: Only Rosetta 3.5 and later versions are supported by current CS-Rosetta package.
Also see here for a tutorial on how to install and use the Rosetta program.

https://www.rosettacommons.org/

BLAST required to prepare de novo fragment candidates for CS-ROSETTA.

BLAST generates amino acid sequence profile from sequence alignments.
Note 1: The C++ version BLAST+ is not currently supportted, please use the legacy versions
Note 2: nr database is recommended

https://ftp.ncbi.nlm.nih.gov/blast/executables/
https://ftp.ncbi.nlm.nih.gov/blast/db/

Contents

What is CS-ROSETTA?
Components of the CS-ROSETTA System
How to Use CS-ROSETTA
Chemical Shift Input Format Used by CS-ROSETTA
How to Select Consistent Predictions in CS-ROSETTA

About CS-ROSETTA

To date, interpretation of isotropic chemical shifts in structural terms is largely based on empirical correlations gained from the mining of protein chemical shifts deposited in the BMRB, in conjunction with the known corresponding 3D structures. Chemical-Shift-ROSETTA (CS-ROSETTA) is a robust protocol to exploit this relation for de novo protein structure generation, using as input parameters the ¹³C^α, ¹³C^β, ¹³C', ¹⁵N, ¹H^α and ¹H^N NMR chemical shifts. These shifts are generally available at the early stage of the traditional NMR structure determination procedure, prior to the collection and analysis of structural restraints. The CS-ROSETTA approach, as shown below, utilizes SPARTA-based selection of protein fragments from the PDB, in conjunction with a regular ROSETTA Monte Carlo assembly and relaxation procedure. Evaluation of 16 proteins, varying in size from 56 to 129 residues yielded full atom models that deviate by 0.7-1.8 Å backbone rnsd from the experimentally determined X-ray or NMR structures. The strategy also has been successfully applied in a blind manner a set of structural genomics targets with molecular weights up to 16 kDa, whose conventional NMR structure determination was conducted in parallel.

top

CS-ROSETTA Flowchart

Components of the CS-ROSETTA System The CS-ROSETTA core system is implemented in the C++ language. Moreover, multiple Unix shell scripts are provided in the CS-ROSETTA package to evaluate and prepare the inputs, analyze the output, and so on. All files/scripts and directories of the CS-ROSETTA system include:

csrosetta

[+]

master script to run CS-ROSETTA (click left to see all allowed options).

# Input options
`-in`	[none]	Input Chemical Shift Table.
`-pdb`	[none]	Reference PDB Structure Input.
`-noe`	[none]	NOE Restraint Input File.
`-rdc`	[none]	RDC Restraint Input File.
`-rdc2`	[none]	RDC Restraint Input File (from medium #2).
`-rdc3`	[none]	RDC Restraint Input File (from medium #3).
`-keepTails`	[default]	To Keep Flexible Tails.
`-trimTails`		To Trim Flexible Tails and Use Trimmed As Input.
`-selRes2Trim`	[none]	List of Specified Tail Residues to be Trimmed.
# Chemical Shift (CS) Quality Check and Correction:
`-offset`	[default]	Apply CS Offset Correction If Needed.
`-iso`		Apply 2H Isotope Correction to CA/CB Shifts.
# Database and Homology Options:
`-db`	[PDB]	List of PDB Database (default) or .pdb Files to be Searched.
`-excl`	[none]	List of Proteins (PDB Identifiers) to be Excluded.
`-maxSeqID`	[100]	Max Sequence Identity to be Searched.
# Fragment Searching Options:
`-wCS`	[1.5]	Weight for Chemical Shift.
`-wAA`	[1.5]	Weight for Amino Acid Sequence.
`-wRama`	[1.0]	Weight for Amino Acid Ramachandran Distribution.
`-wSS`	[0.25]	Weight for Secondary Structure.
`-wPhiPsi`	[0.15]	Weight for Phi/Psi Trosion Angles.
`-count`	[200]	Number of Selected Fragment Candidates.
# ROSETTA Options:
`-standard`	[true]	Generate CS-Rosetta Scripts and Inputs (default).
`-fold_dock`	[false]	Generate CS-Rosetta Inputs for Fold_Dock Protein Assembles.
`-symDock`	[false]	Generate CS-Rosetta Inputs for Symm_Dock Protein Assembles.
`-symm`	[C2]	Symmetry for Fold_Dock or Symm_Dock.

csrosetta_init.com

[+]

initialization script to define all required and optional environment variables to run CS-ROSETTA master script csrosetta. (need to be filled by users after installation)

bin/ [+] [-] directory for all compiled binary files for Linux (*.linux9, *.static.linux9) and MacOS (*.mac)

scripts/ [+] [-] directory for all required utility scripts of CS-ROSETTA

demo/ [+] [-] directory with example chemical shift input data and scripts for a demo of CS-ROSETTA.

click [+]/[-] to see/hide the expand view and details for a given component

top

NMR Input Data for CS-ROSETTA

CS-ROSETTA system is desiged to, utilizing majorly the backbone and ¹³C^β chemical shifts, to preparing and applying CS-ROSETTA structure generation. To use these features, users need to follow the below procedures to properly prepare and inspect their NMR input data.

Chemical Shift Data Format and Requirements
CS-ROSETTA requires an input chemical shift table of standard nmrPipe/TALOS format. An example portion of the required chemical shift table format is shown below (full example: ubiq.tab). Other examples can be found in the CSROSETTA/demo directory, or at the CS-ROSETTA Server site.

Click [+] to see/hide the full details of the requirements of the chemical shift table format

The chemical shift input file for CS-ROSETTA uses the general-purpose NMRPipe table format.
All ¹³C chemical shifts (including ¹³C^α, ¹³C^β, and ¹³C') should be referenced relative to the methyl groups of 4,4-dimethyl-4-silapentane-1-sulfonic acid, or DSS. The ¹⁵N chemical shifts should be referenced relative to liquid ammonia at 25 degrees C.
Use the optional DATA FIRST_RESID line to specify the first residue ID number of the sequence. If it is not specified, residue numbering is assumed to begin at 1.
The protein sequence should be given as shown, using one or more DATA SEQUENCE lines. Space characters in the sequence will be ignored. Use c for oxidized CYS (C^β ~ 42.5 ppm) and C for reduced CYS (C^β ~ 28 ppm), h for protonated HIS and H for deprotonated HIS, in both the sequence header and the shift table. Use X for residues other than the usual 20 amino acids.
The data table must include columns for residue ID (listed as RESID in the VARS header), one-character residue name (RESNAME), atom name (ATOMNAME), and chemical shift (SHIFT).
The table must include a "VARS" line which labels the corresponding columns of the data table.
The table must include a "FORMAT" line which defines the data type of the corresponding columns of the table.

Atom names are always given exactly as:

HA	for Hα of all residues except glycine
HA2	for the first Hα of glycine residues
HA3	for the second Hα
C	for C' (CO)
CA	for Cα
CB	for Cβ
N	for N-amide
HN	for H-amide

As noted, there is an exception for naming Gly assignments, which should use HA2 and HA3 instead of HA. In the case of Gly HA2/HA3 assignments, POMONA will use the average value of the two, so that it is not necessary to have these assigned stereo specifically; for use of POMONA, the assignment can be arbitrary. Note however that the assignment must be given exactly as either "HA2" or "HA3" rather than "HA2|HA3" etc.
Other types of assignments may be present in the chemical shift table; they will be ignored.

Click [+] to see/hide an example NMRPipe/TALOS format chemical shift table (excerpt):

   REMARK Ubiquitin

   DATA FIRST_RESID 1

   DATA SEQUENCE MQIFVKTLTG KTITLEVEPS DTIENVKAKI QDKEGIPPDQ QRLIFAGKQL
   DATA SEQUENCE EDGRTLSDYN IQKESTLHLV LRLRGG

   VARS   RESID RESNAME ATOMNAME SHIFT
   FORMAT %4d   %1s     %4s      %8.3f

     1 M           HA                  4.23
     1 M           C                 170.54
     1 M           CA                 54.45
     1 M           CB                 33.27
     2 Q           HN                  8.90
     2 Q           N                 123.22
     2 Q           HA                  5.25
     2 Q           C                 175.92
     2 Q           CA                 55.08
     2 Q           CB                 30.76
   ...

CS-ROSETTA can also use chemical shift input in the BMRB NMR-Star format. Two conversion Unix shell scripts, bmrb2talos_v21.com and bmrb2talos_v31.com, are included with the POMONA package and can be used to convert a NMR-Star format (V2.1 and V3.1 respectively) chemical shift table to TALOS format. Example command lines for using these scripts are:

bmrb2talos_v21.com bmrb_v21.str > inCS.tab
bmrb2talos_v31.com bmrb_v31.str

See also here for more details regarding the NMRPipe/TALOS format and NMRStar format.

NOE Constraint Data Format [+] Click above [+] to see the expand view

Use of NOE contraints in CS-Rosetta is limitted to the step of Rosetta structure generation, therefore the the NOE contraints must be prepared with a Rosetta compatible format, see here for all allowed formats by Rosetta. In summary, a general format for NOE constrains can be defined by AtomPair such as:

#AtomPair: Atom1_Name Atom1_ResNum Atom2_Name Atom2_ResNum Func_Type Func_Def
AtomPair     H      3     H   112 BOUNDED 1.500 2.910 0.300
AtomPair     H      7     H   108 BOUNDED 1.500 2.720 0.300
AtomPair     H      9     H   106 BOUNDED 1.500 3.070 0.300

ambiguous NOE constrains can be defined by AmbiguousNMRDistance such as:

#AmbiguousNMRDistance: Atom1_Name Atom1_ResNum Atom2_Name Atom2_ResNum Func_Type Func_Def
# Ambiguous Distance between Atom1 and Atom2. The difference from AtomPair Constraint is that 
# atom names are specially parsed to detect ambiguous hydrogens, which are either experimentally 
# ambiguous or rotationally identical (like methyl hydrogens). The constraint applies to any 
# hydrogens equivalent to the named hydrogen. The logic for determining which hydrogens are which 
# is in src/core/scoring/constraints/AmbiguousNMRDistanceConstraints.cc:parse_NMR_name

RDC Constraint Data Format [+] Click above [+] to see the expand view

Use of RDC in CS-Rosetta is also limitted to the step of Rosetta structure generation, therefore the the NOE contraints must be prepared with a Rosetta compatible format, such as:

   2  N    2  H  4.800
   3  N    3  H 10.220
   5  N    5  H 27.130
   6  N    6  H 21.608

Chemical Shift Data Inspection

As the major inputs to CS-ROSETTA approach, the quality of the chemical shifts is therefore critical to achieve expected performance. The pre-check module from the TALOS-N/TALOS+ program can be use to apply a quality inspection for the chemical shift inputs:

TALOS-N/TALOS+ can identify possible referencing problems with the ¹³C^α, ¹³C^β, ¹³C' and¹H^α chemical shift inputs and possbiel chemical shift outliers when running a typical TALOS-N/TALOS+ command with an additional -check option, for example by using the command line input argument:

talosn -in inCS.tab -check

This module first converts the chemical shifts of each residue to secondary chemical shifts, and subsequently evaluates these by correlating ¹³C^α, ¹³C^β, ¹³C' and¹H^α to the reference-free entity, ¹³C^α-¹³C^β. The estimated chemical shift referencing offsets, as well as their corresponding fitting error, will be printed for ¹³C^α, ¹³C^β, ¹³C' and¹H^α; this pre-check module will also identify residues with unusual chemical shifts, for which secondary chemical shifts fall outside the expected range. An example output of this module is with the following format:

   Chemical shift outlier checking...
     ...
     64 E CB Secondary Shift: -3.800 Limit: -3.765
     76 G  C Secondary Shift:  4.250 Limit:  1.925 !

   Chemical shift referencing checking...
      Estimated Referencing Offset for CA/CB: 0.795 +/- 0.104 ppm (Size: 66)

Note that:

An offset correction generally is only needed when the estimated referencing offset exceeds the average fitting error by more than about five standard deviations. To apply the offset correction, a script applyOffsetCorrection.com included in the POMONA package can be used with a following syntax:

applyOffsetCorrection.com inCS.tab

The chemical shift outliers, especially those with highly unusual chemical shifts, for which secondary chemical shifts deviate from the expected range by more than 2 times of the normal range of secondary chemical shifts, may correspond to experimental errors, and need to be inspected carefully prior to using them. For example, as shown in the above example, the identified chemical shift outlier from residue 76 correspond to a C-terminal carboxylate instead of a backbone carbonyl.

²H isotope correction for ¹³C^α/¹³C^β chemical shifts (Maltsev et al. J.Biol.NMR, 2012, 54, 181-191) is also required for chemical shifts measured for per-deuterated protein samples. To do this, a script applyIsotopeCorrection2CACB.com included in the POMONA package can be used with a following syntax:

applyIsotopeCorrection2CACB.com inCS.tab

Note that scripts applyOffsetCorrection.com and applyIsotopeCorrection2CACB.com will apply corections to the orginal chemical shift input files, while the orginal input file is re-named with a .orig suffix.

CS-Rosetta has a default option (-offset) to check the referencing offset and apply the possible correction for the chemical shift input, as well as an option (-iso) to apply the ²H isotope correction on the fly. However, it is still recommended to users to properly prepare and carefully inspect their chemical shifts prior to using them as input to CS-Rosetta.

Handling Flexible Tails/Loops [+]

top

How to Use CS-ROSETTA

For a query protein with known backbone and ¹³C^β chemical shifts, CS-ROSETTA is designed for (1) searching the selected protein structural database for the best matched 3-residue and 9-residue protein fragments, (2) running a ROSETTA structure generation procedure, and (3) evaluating and selecting the generated Rosetta structures. These features can be performed by using the master script csrosetta and the scripts generated by it, for which the most common procedure on an Unix environment will involve:

Create a directory for the prediction session; all subsequent commands will be executed from this directory.
Prepare the input table of chemical shift assignments (for example "myshifts.tab") with a proper format (see the previous section); please also carefully inspect the chemical shifts for the possible referencing offset and outliers.
Run CS-Rosetta master script csrosetta. Most commonly, this will simply require a command line such as:
```
csrosetta -in myshifts.tab
```
This will perform fragment generation, prepare inputs and script for running the Rosetta structure generation and the structure analysis, for which the details are listed below:

I. Protein Fragments Generation Originally, the MFR program was used by CS-Rosetta for finding the matched short fragments from a selected structural database. This step now is performed in CS-Rosetta by using a newer fragment picker integrated in the Rosetta Software Suite (Version 3.0 and newer)(see ref). To run the framgent searching using the CS-Rosetta master script, a Rosetta 3.0 fragment picker based script


runFragPick_Rosetta3.com

is used, which simply follow a procedure listed below:

TALOS+ or TALOS-N prediction is first performed for predicting various structural factors, such as the backbone torsion angles (pred.tab) and the secondary structure (predSS.tab), which are used as the addtional inputs for the following step of fragment picking.
The script makeBlastCheckpoint.com is then executed to (run Psi-blast program to) generate the amino acid sequence profile information, which is used as the required inputs for the Rosetta fragment picking procedure. It uses a FASTA sequence file t000_.fasta as input, generates a sequence profile file t000_.checkpoint and a homology file t000_.homologs.

A Rosetta command file flags Rosetta command file to run chemical shift based fragment picking.

-database                 /Rosetta/database
-in:file:vall             /Rosetta/vall.apr24.2008.extended
-frags:n_frags            200
-frags:frag_sizes         3 9
-frags:scoring:config     scores.cfg
-in:file:checkpoint       t000_.checkpoint
-frags:denied_pdb         t000_.homologs
-in:file:talos_cs         ubiq.tab
-in:file:fasta            t000_.fasta
-frags:ss_pred            predSS.tab talos
-in:file:talos_phi_psi    pred.tab
-frags:sigmoid_cs_A       2
-frags:sigmoid_cs_B       4

-frags:describe_fragments frags.fsc.score
-out:file:frag_prefix     t000_

and a Rosetta scoring definition file scores.cfg Rosetta scoring file to define the weights and priority of various inputs for running chemical shift based fragment picking.

# score name       priority    wght   min_allowed  extras
CSScore             400        1.5            -          
ProfileScoreL1      300        1.5            -          
TalosSSSimilarity   200        0.25           -    talos 
RamaScore           100        1              -    talos 
PhiPsiSquareWell     50        0.15           -

is generated, Rosetta fragment picking is then performed to generate two sets of de novo fragments t000_.200.3mers.gz and t000_.200.3mers.gz.

After fragment generation, the required files and scripts to run the following CS-ROSETTA structure generation procedure are generated, and stored in a directory of csRosetta which includes:

csRosetta/

[+]

More under construction ...

CS-ROSETTA Server

CS-ROSETTA Server can be used for the generating fragments and all required inputs to run CS-ROSETTA struture generation. Users need to submit their input chemical shift file to the server, the server will send the results back via email.

II. CS-ROSETTA Structure Generation
To perform the Rosetta structure modeling, users need to run the generated script runCSRosetta in runCSRosetta directory. Note that the script runCSRosetta generally require a manual modification for the Rosetta installation environment defined in the begenning of the script, or for performing parallel jobs on computing clusters.

Script runCSRosetta

#!/bin/csh
#

# ====== ***PLEASE verify & modify below definitions*** ======
set rosettaBinDir = /Your_Rosetta_Directory/main/source/bin/
set rosettaBin    = $rosettaBinDir/minirosetta.default.linuxgccrelease
set rosettaMPIBin = $rosettaBinDir/minirosetta.mpi.linuxgccrelease
set rosettaDB     = /Your_Rosetta_Directory/main/database
# ===========================================================

$rosettaBin @flags -database $rosettaDB -out:file:silent default.out -out:file:scorefile default.sc

#MPI command
#mpirun -np 10 $rosettaMPIBin @flags -database $rosettaDB -out:file:silent default.out -out:file:scorefile default.sc

After the CS-ROSETTA structure generation job is done, users need to run the analysis script analyzeCSRosetta. A directory ExtractedPDBs will be generated, which includes:

ExtractedPDBs/

[+]

More under construction ...

top

How to Select Consistent CS-ROSETTA Predictions
Criteria for convergence and accepting models

After finishing CS-ROSETTA structure generation, users have to decide whether the ROSETTA models are acceptable. For this purpose, it is convenient to plot the "landscape" of (re-scored) ROSETTA full-atom energies of all models with respect to their C_alpha RMSD values relative to the lowest-energy model, using the data stored in a file "name.rms.rescore.txt".

Converged:

If the 10 lowest energy models all differ by less than 2 Å C^α-RMSD from the model with the lowest (re-scored) energy (see the following example plot from the structure prediction of protein GB3), the structure prediction is deemed successful and the 10 lowest energy models are accepted. Although results where clustering around the lowest energy structure is less tight than 2 Å may still be useful for further analysis, such results should not be over-interpreted and could be in error.

Convergence Plot for Protein gb3

Not Converged:

If no clustering around low energy models is observed (see the following example plot generated for protein nsp1), the structure prediction has not converged and the low energy models can not be accepted at this stage

Convergence Plot for Protein nsp1

Number of models required

By using the current method implemented in CS-ROSETTA package, 5,000 to 20,000 predicted CS-ROSETTA models are generally required to obtain convergence. For small proteins (<= 90-100 amino acids), 1,000 to 5,000 CS-ROSETTA models often suffice. ROSETTA takes about 5-10 minutes to calculate one all-atom model on a single 2.4GHz CPU. The number of Rosetta models to be generated can be specified by modifying the -nstruct option listed in the flags file .

top

* All documents in PDF format require the free Adobe Acrobat Reader application for viewing

[ Home ] [ NIH ] [ NIDDK ] [ Terms of Use ]
last update: Jun 6 2020 / sy

	From the Bax Group at the National Institutes of Health ... CS-ROSETTA: System for chemical shifts based protein structure prediction using ROSETTA Web: https://spin.niddk.nih.gov/bax-apps/software/CSROSETTA \| Server: https://spin.niddk.nih.gov/bax-apps/nmrserver/csrosetta Tutorial: 2020 NMRbox Summer Workshop
Contact: ShenYang@niddk.nih.gov Bax@nih.gov	References: [more] [less] Consistent blind protein structure generation from NMR chemical shift data Yang Shen, Oliver Lange, Frank Delaglio, et al., Proc Natl Acad Sci USA, (2008) 105, 4685-4690 De novo protein structure generation from incomplete chemical shift assignments Yang Shen, Robert Vernon, David Baker and Ad Bax, J Biomol NMR, (2009) 43, 63-78