NMRPipe Table Format
Many facilities in NMRPipe use input or produce output in the form of a multi-column text table. Examples include peak tables, chemical shift tables, and tables of dipolar couplings. NMRPipe has its own general-purpose table format, along with several tools for manipulating such tables. In NMRPipe documentation, this format is sometimes refered to as the Generic Database format, or GDB.
The NMRPipe GDB Table format is a text file which contains header information and one or more columns of data:
VARSline and a
FORMATline. These can be at any location in the table, but are usually together near the top of the table, before any columns of data.
VARSline labels the column names (variables). The names are case-sensitive, but are usually all-upper-case.
FORMATline uses format specifiers to define the type of data in a given column. The format specifiers also define the format and precision used if the table values are written by an application to a new table, or if the table values are manipulated in a script.
An example NMRPipe GDB format table follows, in this case a table of protein backbone random coil chemical shifts:
# Random Coil Chemical Shifts VARS RESNAME HA_PPM C_PPM CA_PPM CB_PPM N_PPM FORMAT %s %7.1f %7.1f %7.1f %7.1f %7.1f ALA 4.32 177.8 52.3 19.0 123.8 CYS 4.55 174.6 56.9 28.9 118.8 cys 4.71 174.6 55.4 43.7 118.6 ASP 4.64 176.3 54.0 40.8 120.4 GLU 4.35 176.6 56.4 29.7 120.2 PHE 4.62 175.8 58.0 39.0 120.3 GLY 3.96 174.9 45.1 9999.0 108.8 HIS 4.73 173.3 54.5 27.9 118.2 HIH 4.73 173.3 53.3 28.5 118.2 ILE 4.17 176.4 61.3 38.0 119.9 LYS 4.32 176.6 56.5 32.5 120.4 LEU 4.34 177.6 55.1 42.3 121.8 MET 4.48 176.3 55.3 32.6 119.6 ASN 4.74 175.2 52.8 37.9 118.7 PRO 4.42 177.3 63.1 31.7 135.8 GLN 4.34 176.0 56.1 28.4 119.8 ARG 4.34 176.3 56.1 30.3 120.5 SER 4.47 174.6 58.2 63.2 115.7 THR 4.35 174.7 62.1 69.2 113.6 VAL 4.12 176.3 62.3 32.1 119.2 TRP 4.66 176.1 57.7 30.3 121.3 TYR 4.55 175.9 58.1 38.8 120.3
As noted above, since columns are space-separated, no missing or all-blank values are allowed.
Instead, place-holder values are used to take the place of missing values,
such as "9999.0" for a missing chemical shift value in the table above.
How missing values are handled depends on the application which is using the table.
The GDB table format allows for the keyword
(Null) to be used in
place of a missing value, but not every application which uses the table
format will accept this.
As mentioned above, the FORMAT line uses format specifiers to define the general
type of data in a column. The format specifiers are adapted from those used by the
UNIX/C formated printing functions such as
%e for floating-point values,
%s for text strings.
Importantly, since format specifiers can effect how table values are manipulated in a script, format specifiers for floating point values should be always include sufficient precision for the given data.
srepresent text string data.
drepresent integer data.
eboth represent floating point data. Specifiers ending with
fwill be written in decimal form, and specifiers ending in
ewill be written in scientific notation.
Some example format specifiers follow.
||Any integer, with leading minus sign if needed.|
||Any integer, with leading minus sign if needed, output as four characters or more.|
||Any integer, output as four characters or more, left justified.|
||Any integer, zero-padded on the left to four characters.||%f||Float||
||Any Floating Point Number, with a leading minus sign if needed.||%+f||Float||
||Any Floating Point Number, always including a leading plus or minus sign.||%.2f||Float||
||Any Floating Point Number, with a leading minus sign if needed, with two places after the decimal.||%6.2f||Float||
||Any Floating Point Number, with two places after the decimal, output as six characters or more.||%e||Float||
||Any Floating Point Number, in scientific notation.||%.2e||Float||
||Any Floating Point Number, with two places after the decimal, in scientific notation.||%s||String||
||Any Text String.||%4s||String||
||Any Text String, output as four characters or more, right justified.||%-4s||String||
||Any Text String, output as four characters or more, left-justified.|
Specification of Atom Names
Many NMRPipe applications use information associated with one or more atoms.
For example, in a chemical shift table, each chemical
shift entry corresponds to a particular atom. In a dipolar coupling table,
each dipolar coupling corresponds to a pair of atoms
And, in a J-coupling table, each
J-coupling is associated with a torsion, specfied by four atoms
In the NMRPipe table format, a given atom is identified according to a residue ID, residue name, and atom name.
These correspond to table variable names
ATOMNAME will generally be treated as case-sensitive,
but are usually all-upper-case. In case of systems with more than one chain or molecule,
an atom specifcation can include an optional chain ID or segment ID. These correspond to table variable names
In the case of entries which identify two or more atoms, the variable names will have a post-fix
For example, each dipolar coupling entry specifies two atoms
so each entry must include columns for
RESNAME_J, and might also
include columns for
Specification of Amino Acid Sequence
Many NMRPipe applications, such as TALOS,
use input data specifically for proteins. Many of these applications require that complete amino
acid sequence information is included in an input table. This is commonly
done by including a
DATA FIRST_RESID and one or more
lines, as shown in this chemical shift table.
FIRST_RESID line gives the starting residue
number of the sequence, which is assumed to be 1 if no
line is given. The
DATA SEQUENCE lines give the amino acid
sequence in single-letter codes, with the code
X commonly used for non-standard
amino acids. In may examples, the amino acid codes are specified in groups of
10 for clarity. In practice, space characters in the amino acid sequence are ignored,
so the amino acid codes can be grouped in any way, and any number of codes
can be given in one
DATA SEQUENCE line.
For convenience, the NMRPipe table utility
getTabInfo.tcl can create
DATA SEQUENCE text from the sequence of a PDB structure. For example,
getTabInfo.tcl -in 1UBQ.pdb -seqTextwill produce output like this:
DATA FIRST_RESID 1 DATA SEQUENCE MQIFVKTLTG KTITLEVEPS DTIENVKAKI QDKEGIPPDQ QRLIFAGKQL DATA SEQUENCE EDGRTLSDYN IQKESTLHLV LRLRGG
Applications for Manipulating Tables
NMRPipe includes several general-purpose applications for manipulating GDB-Format tables. There are applications to display tables, plot data from tables, sort and adjust table values, select a subset of entries according to a condition, and extract table values for use with other scripts.
For example, to extract and print all the chemical shift values in a chemical shift table:
Application Use delTab.tcl Delete randomly or systematically selected rows. diffTab.tcl Form the difference between values in two related tables. plotTab.tcl Draw an XY Plot of two columns in a table. svdTab.tcl Perform linear least squares on columns of a table. addTabNoise.tcl Add random noise to data in a table. fitTab.tcl Apply a fitting function to data from a table. addTabVar.tcl Add new columns (variables) to a table. getTabCol.tcl Get the values from a given table column. selectTab.tcl Select and save table entries according to a condition. adjTab.tcl Adjust or set values in a table. getTabInfo.tcl Get information about a table. appendTab.tcl Join two related tables. getTabRow.tcl Get the values from a given table row. showTab.tcl Display a table in an interactive viewer.
selectTab.tcl -in csObs.tab -var SHIFTTo extract shifts as above, for residues 10 to 20, alone:
selectTab.tcl -in csObs.tab -var SHIFT -cond "RESID >= 10 && RESID <= 20"To extract only HN shifts:
selectTab.tcl -in csObs.tab -var SHIFT -cond "strmatch( ATOMNAME, 'HN' )"To list the variable names in a table:
getTabInfo.tcl -in dCalcA.tab -parm varNamesTo extract the DATA SAUPE values in a dipolar coupling calculation output:
getTabInfo.tcl -in dCalcA.tab -key SAUPE -data 0To add a new column of floating point data called
Wto an existing table, with an initial value of 100.0 in each row:
addTabVar.tcl -in dc.tab -out dcw.tab -var W -float -fmt %7.3f -val 100.0Catenate two PDB files, and renumber their atom ID values:
cat a.pdb b.pdb > ab.pdb adjTab.tcl -in ab.pdb -out new.pdb -pdb -renumberExtract the first residue ID and the one-letter amino acid sequence from a protein PDB file:
getTabInfo.tcl -in ref.pdb -seqInfoDisplay protein sequence information from a PDB file and print it in the DATA SEQUENCE format used in NMRPipe tables:
getTabInfo.tcl -in ref.pdb -seqText
As shown in the previous examples, Many of the table manipulation applications above can accept input in the form of a PDB file. In this case, the variable names used to access the table are automatically set as if the PDB file had VARS and FORMAT lines. Since a PDB file is not space-delimited, data is extracted according to a range of character positions, where the first character in a line is position 1:
Variable Name Format Character Range ATOMID %d 7 - 11 ATOMNAME %s 13 - 16 LOCID %s 17 - 17 RESNAME %s 18 - 21 CHAINID %s 22 - 22 RESID %d 23 - 26 ICODE %s 27 - 27 X %.3f 31 - 38 Y %.3f 39 - 46 Z %.3f 47 - 54 OCCUPANCY %.2f 55 - 60 TEMPFACTOR %.2f 61 - 66 SEGID %s 73 - 76 ELEMENT %s 77 - 78 CHARGE %s 79 - 80
Given the above definitions, it is possible to manipulate a PDB file with NMRPipe's general-purpose table tools. For example, to find the average values of the X Y and Z coordinates of a PDB file, these commands:
set xc = `getTabCol.tcl -in ref.pdb -pdb -var X` set yx = `getTabCol.tcl -in ref.pdb -pdb -var Y` set zc = `getTabCol.tcl -in ref.pdb -pdb -var Z` getStat.tcl -stat Avg -x $xc getStat.tcl -stat Avg -x $yc getStat.tcl -stat Avg -x $zcwould produce output like this:
Avg 30.2337878049 Avg 28.9899536585 Avg 15.3499943089