NMRPipe Table Format

NMRPipe Table Format

Many facilities in NMRPipe use input or produce output in the form of a multi-column text table. Examples include peak tables, chemical shift tables, and tables of dipolar couplings. NMRPipe has its own general-purpose table format, along with several tools for manipulating such tables. In NMRPipe documentation, this format is sometimes refered to as the Generic Database format, or GDB.

The NMRPipe GDB Table format is a text file which contains header information and one or more columns of data:

The table is expected to be UNIX-style plain text, with newline characters at the end of each line.
The header consists of a VARS line and a FORMAT line. These can be at any location in the table, but are usually together near the top of the table, before any columns of data.
The columns in the table can be separated by any number of spaces.
Any number of blank lines are allowed.
The values in a given column should all be of the same general type, either integer, floating point, or text string values.
The VARS line labels the column names (variables). The names are case-sensitive, but are usually all-upper-case.
The FORMAT line uses format specifiers to define the type of data in a given column. The format specifiers also define the format and precision used if the table values are written by an application to a new table, or if the table values are manipulated in a script.
Since format specifiers can effect how table values are manipulated in a script, the details of the format specifiers in a given table must be suitable for the magnitude and precision of the data values. Format specifiers are discussed in more detail below.
Any floating point value in the table can be given in any of the usual decimal or scientific notation formats.
Optional comment lines can start with either the character # or the keyword REMARK.
The table may contain one or more DATA lines. This provides a way to include other types of information in a table. For example, DATA SEQUENCE lines are used to include protein amino acid sequence information in a chemical shift table.
Text values which contain spaces can be quoted by single quotes ' or double quotes ". However, not every application can accept text values which contain spaces, so quotes are not widely used.
Since columns are space-separated, no "all-blank" missing values are allowed; instead, a place-holder value must be used (see below).
Currently, text string values are limited to about 1,000 characters in length.

An example NMRPipe GDB format table follows, in this case a table of protein backbone random coil chemical shifts:


   # Random Coil Chemical Shifts

   VARS   RESNAME HA_PPM C_PPM CA_PPM CB_PPM N_PPM   
   FORMAT %s      %7.1f  %7.1f %7.1f  %7.1f  %7.1f
   
    ALA      4.32   177.8   52.3   19.0   123.8  
    CYS      4.55   174.6   56.9   28.9   118.8
    cys      4.71   174.6   55.4   43.7   118.6
    ASP      4.64   176.3   54.0   40.8   120.4
    GLU      4.35   176.6   56.4   29.7   120.2
    PHE      4.62   175.8   58.0   39.0   120.3
    GLY      3.96   174.9   45.1 9999.0   108.8
    HIS      4.73   173.3   54.5   27.9   118.2
    HIH      4.73   173.3   53.3   28.5   118.2
    ILE      4.17   176.4   61.3   38.0   119.9
    LYS      4.32   176.6   56.5   32.5   120.4
    LEU      4.34   177.6   55.1   42.3   121.8
    MET      4.48   176.3   55.3   32.6   119.6
    ASN      4.74   175.2   52.8   37.9   118.7
    PRO      4.42   177.3   63.1   31.7   135.8 
    GLN      4.34   176.0   56.1   28.4   119.8
    ARG      4.34   176.3   56.1   30.3   120.5
    SER      4.47   174.6   58.2   63.2   115.7
    THR      4.35   174.7   62.1   69.2   113.6
    VAL      4.12   176.3   62.3   32.1   119.2
    TRP      4.66   176.1   57.7   30.3   121.3
    TYR      4.55   175.9   58.1   38.8   120.3

Missing Values

As noted above, since columns are space-separated, no missing or all-blank values are allowed. Instead, place-holder values are used to take the place of missing values, such as "9999.0" for a missing chemical shift value in the table above. How missing values are handled depends on the application which is using the table. The GDB table format allows for the keyword (Null) to be used in place of a missing value, but not every application which uses the table format will accept this.

FORMAT Specifiers

As mentioned above, the FORMAT line uses format specifiers to define the general type of data in a column. The format specifiers are adapted from those used by the UNIX/C formated printing functions such as printf: %d for integers, %f or %e for floating-point values, and %s for text strings.

Importantly, since format specifiers can effect how table values are manipulated in a script, format specifiers for floating point values should be always include sufficient precision for the given data.

Format specifiers always begin with the % percent character.
Format specifiers which end in s represent text string data.
Format specifiers which end in d represent integer data.
Specifiers which end in f and e both represent floating point data. Specifiers ending with f will be written in decimal form, and specifiers ending in e will be written in scientific notation.
When written in an output, values are right-justified by default, and can be left-justified by using a leading minus sign - in the format specifier.

Note that values in an input table do not need to match the details of the format specifier, they only need to match the general type, either text string, integer, or floating point value. Specifically:

Text string values can contain any number of characters (up to about 1,000 characters maximum length) regardless of any size given in the format specifier.
Integer values can contain any number of digits, subject to the limits of four-byte signed integers, regardless of any size given in the format specifier.
Any floating point value can be given in any common format, subject to the limits of four-byte single-precision data, in either decimal or scientific notation, regardless of the details given in the format specifier.

Some example format specifiers follow.

Format	Type	Example	Meaning
%d	Integer	`16`	Any integer, with leading minus sign if needed.
%4d	Integer	`16`	Any integer, with leading minus sign if needed, output as four characters or more.
%-4d	Integer	`16`	Any integer, output as four characters or more, left justified.
%04d	Integer	`0016`	Any integer, zero-padded on the left to four characters.
%f	Float	`3.276400`	Any Floating Point Number, with a leading minus sign if needed.
%+f	Float	`+3.276400`	Any Floating Point Number, always including a leading plus or minus sign.
%.2f	Float	`3.28`	Any Floating Point Number, with a leading minus sign if needed, with two places after the decimal.
%6.2f	Float	`3.28`	Any Floating Point Number, with two places after the decimal, output as six characters or more.
%e	Float	`1.200352e+02`	Any Floating Point Number, in scientific notation.
%.2e	Float	`1.20e+02`	Any Floating Point Number, with two places after the decimal, in scientific notation.
%s	String	`GLY`	Any Text String.
%4s	String	`GLY`	Any Text String, output as four characters or more, right justified.
%-4s	String	`GLY`	Any Text String, output as four characters or more, left-justified.

Specification of Atom Names

Many NMRPipe applications use information associated with one or more atoms. For example, in a chemical shift table, each chemical shift entry corresponds to a particular atom. In a dipolar coupling table, each dipolar coupling corresponds to a pair of atoms I and J. And, in a J-coupling table, each J-coupling is associated with a torsion, specfied by four atoms I, J, K, and L.

In the NMRPipe table format, a given atom is identified according to a residue ID, residue name, and atom name. These correspond to table variable names RESID RESNAME and ATOMNAME. Values for RESNAME and ATOMNAME will generally be treated as case-sensitive, but are usually all-upper-case. In case of systems with more than one chain or molecule, an atom specifcation can include an optional chain ID or segment ID. These correspond to table variable names CHAINNAME and SEGNAME.

In the case of entries which identify two or more atoms, the variable names will have a post-fix such as _I _J _K or _L. For example, each dipolar coupling entry specifies two atoms I and J, so each entry must include columns for RESID_I ATOMNAME_I RESNAME_I and RESID_J ATOMNAME_J RESNAME_J, and might also include columns for CHAINNAME_I and CHAINNAME_J or SEGNAME_I and SEGNAME_J.

Specification of Amino Acid Sequence

Many NMRPipe applications, such as TALOS, use input data specifically for proteins. Many of these applications require that complete amino acid sequence information is included in an input table. This is commonly done by including a DATA FIRST_RESID and one or more DATA SEQUENCE lines, as shown in this chemical shift table.

The FIRST_RESID line gives the starting residue number of the sequence, which is assumed to be 1 if no FIRST_RESID line is given. The DATA SEQUENCE lines give the amino acid sequence in single-letter codes, with the code X commonly used for non-standard amino acids. In may examples, the amino acid codes are specified in groups of 10 for clarity. In practice, space characters in the amino acid sequence are ignored, so the amino acid codes can be grouped in any way, and any number of codes can be given in one DATA SEQUENCE line. For convenience, the NMRPipe table utility getTabInfo.tcl can create DATA SEQUENCE text from the sequence of a PDB structure. For example, the command:

   getTabInfo.tcl -in 1UBQ.pdb -seqText

will produce output like this:

   DATA FIRST_RESID 1

   DATA SEQUENCE MQIFVKTLTG KTITLEVEPS DTIENVKAKI QDKEGIPPDQ QRLIFAGKQL
   DATA SEQUENCE EDGRTLSDYN IQKESTLHLV LRLRGG

Applications for Manipulating Tables

NMRPipe includes several general-purpose applications for manipulating GDB-Format tables. There are applications to display tables, plot data from tables, sort and adjust table values, select a subset of entries according to a condition, and extract table values for use with other scripts.

Application Use

delTab.tcl Delete randomly or systematically selected rows.

diffTab.tcl Form the difference between values in two related tables.

plotTab.tcl Draw an XY Plot of two columns in a table.

svdTab.tcl Perform linear least squares on columns of a table.

addTabNoise.tcl Add random noise to data in a table.

fitTab.tcl Apply a fitting function to data from a table.

addTabVar.tcl Add new columns (variables) to a table.

getTabCol.tcl Get the values from a given table column.

selectTab.tcl Select and save table entries according to a condition.

adjTab.tcl Adjust or set values in a table.

getTabInfo.tcl Get information about a table.

appendTab.tcl Join two related tables.

getTabRow.tcl Get the values from a given table row.

showTab.tcl Display a table in an interactive viewer.

Application	Use
delTab.tcl	Delete randomly or systematically selected rows.
diffTab.tcl	Form the difference between values in two related tables.
plotTab.tcl	Draw an XY Plot of two columns in a table.
svdTab.tcl	Perform linear least squares on columns of a table.
addTabNoise.tcl	Add random noise to data in a table.
fitTab.tcl	Apply a fitting function to data from a table.
addTabVar.tcl	Add new columns (variables) to a table.
getTabCol.tcl	Get the values from a given table column.
selectTab.tcl	Select and save table entries according to a condition.
adjTab.tcl	Adjust or set values in a table.
getTabInfo.tcl	Get information about a table.
appendTab.tcl	Join two related tables.
getTabRow.tcl	Get the values from a given table row.
showTab.tcl	Display a table in an interactive viewer.

For example, to extract and print all the chemical shift values in a chemical shift table:

   selectTab.tcl -in csObs.tab -var SHIFT

To extract shifts as above, for residues 10 to 20, alone:

   selectTab.tcl -in csObs.tab -var SHIFT -cond "RESID >= 10 && RESID <= 20"

To extract only HN shifts:

   selectTab.tcl -in csObs.tab -var SHIFT -cond "strmatch( ATOMNAME, 'HN' )"

To list the variable names in a table:

   getTabInfo.tcl -in dCalcA.tab -parm varNames

To extract the DATA SAUPE values in a dipolar coupling calculation output:

   getTabInfo.tcl -in dCalcA.tab -key SAUPE -data 0

To add a new column of floating point data called W to an existing table, with an initial value of 100.0 in each row:

   addTabVar.tcl -in dc.tab -out dcw.tab -var W -float -fmt %7.3f -val 100.0

Catenate two PDB files, and renumber their atom ID values:

   cat a.pdb b.pdb > ab.pdb
   adjTab.tcl -in ab.pdb -out new.pdb -pdb -renumber

Extract the first residue ID and the one-letter amino acid sequence from a protein PDB file:

   getTabInfo.tcl -in ref.pdb -seqInfo

Display protein sequence information from a PDB file and print it in the DATA SEQUENCE format used in NMRPipe tables:

   getTabInfo.tcl -in ref.pdb -seqText

PDB Files

As shown in the previous examples, Many of the table manipulation applications above can accept input in the form of a PDB file. In this case, the variable names used to access the table are automatically set as if the PDB file had VARS and FORMAT lines. Since a PDB file is not space-delimited, data is extracted according to a range of character positions, where the first character in a line is position 1:

Variable Name Format Character Range

ATOMID %d 7 - 11

ATOMNAME %s 13 - 16

LOCID %s 17 - 17

RESNAME %s 18 - 21

CHAINID %s 22 - 22

RESID %d 23 - 26

ICODE %s 27 - 27

X %.3f 31 - 38

Y %.3f 39 - 46

Z %.3f 47 - 54

OCCUPANCY %.2f 55 - 60

TEMPFACTOR %.2f 61 - 66

SEGID %s 73 - 76

ELEMENT %s 77 - 78

CHARGE %s 79 - 80

Variable Name	Format	Character Range
ATOMID	%d	7 - 11
ATOMNAME	%s	13 - 16
LOCID	%s	17 - 17
RESNAME	%s	18 - 21
CHAINID	%s	22 - 22
RESID	%d	23 - 26
ICODE	%s	27 - 27
X	%.3f	31 - 38
Y	%.3f	39 - 46
Z	%.3f	47 - 54
OCCUPANCY	%.2f	55 - 60
TEMPFACTOR	%.2f	61 - 66
SEGID	%s	73 - 76
ELEMENT	%s	77 - 78
CHARGE	%s	79 - 80

Given the above definitions, it is possible to manipulate a PDB file with NMRPipe's general-purpose table tools. For example, to find the average values of the X Y and Z coordinates of a PDB file, these commands:

  set xc = `getTabCol.tcl -in ref.pdb -pdb -var X`
  set yx = `getTabCol.tcl -in ref.pdb -pdb -var Y`
  set zc = `getTabCol.tcl -in ref.pdb -pdb -var Z`

  getStat.tcl -stat Avg -x $xc
  getStat.tcl -stat Avg -x $yc
  getStat.tcl -stat Avg -x $zc

would produce output like this:

   Avg 30.2337878049
   Avg 28.9899536585
   Avg 15.3499943089

[ Home ] [ NIH ] [ NIDDK ]
last updated: Dec 6 2011 / big fd