NMRPipe Table Format
Many facilities in NMRPipe use input or produce output in the form of a multi-column text table. Examples include peak tables, chemical shift tables, and tables of dipolar couplings. NMRPipe has its own general-purpose table format, along with several tools for manipulating such tables. In NMRPipe documentation, this format is sometimes refered to as the Generic Database format, or GDB.
The NMRPipe GDB Table format is a text file which contains header information and one or more columns of data:
VARS
line and a FORMAT
line.
These can be at any location in the table, but are
usually together near the top of the table, before any
columns of data.
VARS
line labels the column names (variables). The names
are case-sensitive, but are usually all-upper-case.
FORMAT
line uses format specifiers to
define the type of data in a given column. The format specifiers
also define the format and precision used if the table values are written
by an application to a new table, or if the table values are
manipulated in a script.
REMARK
.
An example NMRPipe GDB format table follows, in this case a table of protein backbone random coil chemical shifts:
# Random Coil Chemical Shifts VARS RESNAME HA_PPM C_PPM CA_PPM CB_PPM N_PPM FORMAT %s %7.1f %7.1f %7.1f %7.1f %7.1f ALA 4.32 177.8 52.3 19.0 123.8 CYS 4.55 174.6 56.9 28.9 118.8 cys 4.71 174.6 55.4 43.7 118.6 ASP 4.64 176.3 54.0 40.8 120.4 GLU 4.35 176.6 56.4 29.7 120.2 PHE 4.62 175.8 58.0 39.0 120.3 GLY 3.96 174.9 45.1 9999.0 108.8 HIS 4.73 173.3 54.5 27.9 118.2 HIH 4.73 173.3 53.3 28.5 118.2 ILE 4.17 176.4 61.3 38.0 119.9 LYS 4.32 176.6 56.5 32.5 120.4 LEU 4.34 177.6 55.1 42.3 121.8 MET 4.48 176.3 55.3 32.6 119.6 ASN 4.74 175.2 52.8 37.9 118.7 PRO 4.42 177.3 63.1 31.7 135.8 GLN 4.34 176.0 56.1 28.4 119.8 ARG 4.34 176.3 56.1 30.3 120.5 SER 4.47 174.6 58.2 63.2 115.7 THR 4.35 174.7 62.1 69.2 113.6 VAL 4.12 176.3 62.3 32.1 119.2 TRP 4.66 176.1 57.7 30.3 121.3 TYR 4.55 175.9 58.1 38.8 120.3 |
Missing Values
As noted above, since columns are space-separated, no missing or all-blank values are allowed.
Instead, place-holder values are used to take the place of missing values,
such as "9999.0" for a missing chemical shift value in the table above.
How missing values are handled depends on the application which is using the table.
The GDB table format allows for the keyword (Null)
to be used in
place of a missing value, but not every application which uses the table
format will accept this.
FORMAT Specifiers
As mentioned above, the FORMAT line uses format specifiers to define the general
type of data in a column. The format specifiers are adapted from those used by the
UNIX/C formated printing functions such as printf
: %d
for
integers, %f
or %e
for floating-point values,
and %s
for text strings.
Importantly, since format specifiers can effect how table values are manipulated in a script, format specifiers for floating point values should be always include sufficient precision for the given data.
%
percent character.
s
represent text string data.
d
represent integer data.
f
and e
both represent floating
point data. Specifiers ending with f
will be written
in decimal form, and specifiers ending in e
will be
written in scientific notation.
Some example format specifiers follow.
Format | Type | Example | Meaning |
---|---|---|---|
%d | Integer | 16 |
Any integer, with leading minus sign if needed. |
%4d | Integer | 16 |
Any integer, with leading minus sign if needed, output as four characters or more. |
%-4d | Integer | 16 |
Any integer, output as four characters or more, left justified. |
%04d | Integer | 0016 |
Any integer, zero-padded on the left to four characters. | %f | Float | 3.276400 |
Any Floating Point Number, with a leading minus sign if needed. | %+f | Float | +3.276400 |
Any Floating Point Number, always including a leading plus or minus sign. | %.2f | Float | 3.28 |
Any Floating Point Number, with a leading minus sign if needed, with two places after the decimal. | %6.2f | Float | 3.28 |
Any Floating Point Number, with two places after the decimal, output as six characters or more. | %e | Float | 1.200352e+02 |
Any Floating Point Number, in scientific notation. | %.2e | Float | 1.20e+02 |
Any Floating Point Number, with two places after the decimal, in scientific notation. | %s | String | GLY |
Any Text String. | %4s | String | GLY |
Any Text String, output as four characters or more, right justified. | %-4s | String | GLY |
Any Text String, output as four characters or more, left-justified. |
Specification of Atom Names
Many NMRPipe applications use information associated with one or more atoms.
For example, in a chemical shift table, each chemical
shift entry corresponds to a particular atom. In a dipolar coupling table,
each dipolar coupling corresponds to a pair of atoms I
and J
.
And, in a J-coupling table, each
J-coupling is associated with a torsion, specfied by four atoms I
, J
, K
, and L
.
In the NMRPipe table format, a given atom is identified according to a residue ID, residue name, and atom name.
These correspond to table variable names RESID
RESNAME
and ATOMNAME
.
Values for RESNAME
and ATOMNAME
will generally be treated as case-sensitive,
but are usually all-upper-case. In case of systems with more than one chain or molecule,
an atom specifcation can include an optional chain ID or segment ID. These correspond to table variable names
CHAINNAME
and SEGNAME
.
In the case of entries which identify two or more atoms, the variable names will have a post-fix
such as _I
_J
_K
or _L
.
For example, each dipolar coupling entry specifies two atoms I
and J
,
so each entry must include columns for RESID_I
ATOMNAME_I
RESNAME_I
and RESID_J
ATOMNAME_J
RESNAME_J
, and might also
include columns for CHAINNAME_I
and CHAINNAME_J
or SEGNAME_I
and SEGNAME_J
.
Specification of Amino Acid Sequence
Many NMRPipe applications, such as TALOS,
use input data specifically for proteins. Many of these applications require that complete amino
acid sequence information is included in an input table. This is commonly
done by including a DATA FIRST_RESID
and one or more DATA SEQUENCE
lines, as shown in this chemical shift table.
The FIRST_RESID
line gives the starting residue
number of the sequence, which is assumed to be 1 if no FIRST_RESID
line is given. The DATA SEQUENCE
lines give the amino acid
sequence in single-letter codes, with the code X
commonly used for non-standard
amino acids. In may examples, the amino acid codes are specified in groups of
10 for clarity. In practice, space characters in the amino acid sequence are ignored,
so the amino acid codes can be grouped in any way, and any number of codes
can be given in one DATA SEQUENCE
line.
For convenience, the NMRPipe table utility getTabInfo.tcl
can create
DATA SEQUENCE
text from the sequence of a PDB structure. For example,
the command:
getTabInfo.tcl -in 1UBQ.pdb -seqTextwill produce output like this:
DATA FIRST_RESID 1 DATA SEQUENCE MQIFVKTLTG KTITLEVEPS DTIENVKAKI QDKEGIPPDQ QRLIFAGKQL DATA SEQUENCE EDGRTLSDYN IQKESTLHLV LRLRGG
Applications for Manipulating Tables
NMRPipe includes several general-purpose applications for manipulating GDB-Format tables. There are applications to display tables, plot data from tables, sort and adjust table values, select a subset of entries according to a condition, and extract table values for use with other scripts.
For example, to extract and print all the chemical shift values in a chemical shift table:
Application Use delTab.tcl Delete randomly or systematically selected rows. diffTab.tcl Form the difference between values in two related tables. plotTab.tcl Draw an XY Plot of two columns in a table. svdTab.tcl Perform linear least squares on columns of a table. addTabNoise.tcl Add random noise to data in a table. fitTab.tcl Apply a fitting function to data from a table. addTabVar.tcl Add new columns (variables) to a table. getTabCol.tcl Get the values from a given table column. selectTab.tcl Select and save table entries according to a condition. adjTab.tcl Adjust or set values in a table. getTabInfo.tcl Get information about a table. appendTab.tcl Join two related tables. getTabRow.tcl Get the values from a given table row. showTab.tcl Display a table in an interactive viewer.
selectTab.tcl -in csObs.tab -var SHIFTTo extract shifts as above, for residues 10 to 20, alone:
selectTab.tcl -in csObs.tab -var SHIFT -cond "RESID >= 10 && RESID <= 20"To extract only HN shifts:
selectTab.tcl -in csObs.tab -var SHIFT -cond "strmatch( ATOMNAME, 'HN' )"To list the variable names in a table:
getTabInfo.tcl -in dCalcA.tab -parm varNamesTo extract the DATA SAUPE values in a dipolar coupling calculation output:
getTabInfo.tcl -in dCalcA.tab -key SAUPE -data 0To add a new column of floating point data called
W
to
an existing table, with an initial value of 100.0 in each row:
addTabVar.tcl -in dc.tab -out dcw.tab -var W -float -fmt %7.3f -val 100.0Catenate two PDB files, and renumber their atom ID values:
cat a.pdb b.pdb > ab.pdb adjTab.tcl -in ab.pdb -out new.pdb -pdb -renumberExtract the first residue ID and the one-letter amino acid sequence from a protein PDB file:
getTabInfo.tcl -in ref.pdb -seqInfoDisplay protein sequence information from a PDB file and print it in the DATA SEQUENCE format used in NMRPipe tables:
getTabInfo.tcl -in ref.pdb -seqText
PDB Files
As shown in the previous examples, Many of the table manipulation applications above can accept input in the form of a PDB file. In this case, the variable names used to access the table are automatically set as if the PDB file had VARS and FORMAT lines. Since a PDB file is not space-delimited, data is extracted according to a range of character positions, where the first character in a line is position 1:
Variable Name Format Character Range ATOMID %d 7 - 11 ATOMNAME %s 13 - 16 LOCID %s 17 - 17 RESNAME %s 18 - 21 CHAINID %s 22 - 22 RESID %d 23 - 26 ICODE %s 27 - 27 X %.3f 31 - 38 Y %.3f 39 - 46 Z %.3f 47 - 54 OCCUPANCY %.2f 55 - 60 TEMPFACTOR %.2f 61 - 66 SEGID %s 73 - 76 ELEMENT %s 77 - 78 CHARGE %s 79 - 80
Given the above definitions, it is possible to manipulate a PDB file with NMRPipe's general-purpose table tools. For example, to find the average values of the X Y and Z coordinates of a PDB file, these commands:
set xc = `getTabCol.tcl -in ref.pdb -pdb -var X` set yx = `getTabCol.tcl -in ref.pdb -pdb -var Y` set zc = `getTabCol.tcl -in ref.pdb -pdb -var Z` getStat.tcl -stat Avg -x $xc getStat.tcl -stat Avg -x $yc getStat.tcl -stat Avg -x $zcwould produce output like this:
Avg 30.2337878049 Avg 28.9899536585 Avg 15.3499943089