Bio.Structure: Macromolecular Structures

The Bio.Structure module provides functionality to manipulate macromolecular structures, and in particular to read and write Protein Data Bank (PDB) files. It is designed to be used for standard structural analysis tasks, as well as acting as a platform on which others can build to create more specific tools.

Parsing PDB files

To download a PDB file:


To parse a PDB file into a Structure-Model-Chain-Residue-Atom framework:

julia> struc = read(filepath_1EN2, PDB)
Name                        -  1EN2.pdb
Number of models            -  1
Chain(s)                    -  A
Number of residues          -  85
Number of point mutations   -  5
Number of other molecules   -  5
Number of water molecules   -  76
Number of atoms             -  614
Number of hydrogens         -  0
Number of disordered atoms  -  27

The elements of struc can be accessed as follows:

Command Returns Return type
struc[1] Model 1 Model
struc[1]['A'] Model 1, chain A Chain
struc['A'] The lowest model (model 1), chain A Chain
struc['A']["50"] Model 1, chain A, residue 50 AbstractResidue
struc['A'][50] Shortcut to above if it is not a hetero residue and the insertion code is blank AbstractResidue
struc['A']["H_90"] Model 1, chain A, hetero residue 90 AbstractResidue
struc['A'][50]["CA"] Model 1, chain A, residue 50, atom name CA AbstractAtom
struc['A'][15]["CG"]['A'] For disordered atoms, access a specific location Atom

Disordered atoms are stored in a DisorderedAtom container but calls fall back to the default atom, so disorder can be ignored if you are not interested in it.

Disordered residues (i.e. point mutations with different residue names) are stored in a DisorderedResidue container.

The idea is that disorder will only bother you if you want it to. See the Biopython discussion for more.

Properties can be retrieved as follows:

Function Returns Return type
serial Serial number of an atom Int
atomname Name of an atom String
altlocid Alternative location ID of an atom Char
x x coordinate of an atom Float64
y y coordinate of an atom Float64
z z coordinate of an atom Float64
coords coordinates of an atom Array{Float64,1}
occupancy Occupancy of an atom (default is 1.0) Float64
tempfactor Temperature factor of an atom (default is 0.0) Float64
element Element of an atom (default is " ") String
charge Charge of an atom (default is " ") String
residue Residue an atom belongs to Residue
ishetero true if the residue or atom is a hetero residue/atom Bool
isdisorderedatom true if the atom is disordered Bool
pdbline PDB ATOM/HETATM record for an atom String
resname Residue name of a residue or atom String
resnumber Residue number of a residue or atom Int
inscode Insertion code of a residue or atom Char
resid Residue ID of an atom or residue (full=true includes chain) String
atomnames Atom names of the atoms in a residue, sorted by serial Array{String,1}
atoms Dictionary of atoms in a residue Dict{String, AbstractAtom}
isdisorderedres true if the residue has multiple residue names Bool
disorderedres Access a particular residue name in a DisorderedResidue Residue
chain Chain a residue or atom belongs to Chain
chainid Chain ID of a chain, residue or atom Char
resids Sorted residue IDs in a chain Array{String,1}
residues Dictionary of residues in a chain Dict{String, AbstractResidue}
model Model a chain, residue or atom belongs to Model
modelnumber Model number of a model, chain, residue or atom Int
chainids Sorted chain IDs in a model or structure Array{Char,1}
chains Dictionary of chains in a model or structure Dict{Char, Chain}
structure Structure a model, chain, residue or atom belongs to ProteinStructure
structurename Name of the structure an element belongs to String
modelnumbers Sorted model numbers in a structure Array{Int,1}
models Dictionary of models in a structure Dict{Int, Model}

The strip keyword argument determines whether surrounding whitespace is stripped for atomname, element, charge, resname and atomnames (default true).

The coordinates of an atom can be set using x!, y!, z! and coords!.

Manipulating structures

Elements can be looped over to reveal the sub-elements in the correct order:

for mod in struc
    for ch in mod
        for res in ch
            for at in res
                # Do something

Models are ordered numerically; chains are ordered by character, except the empty chain is last; residues are ordered by residue number and insertion code with hetero residues after standard residues; atoms are ordered by atom serial.

collect can be used to get arrays of sub-elements. collectatoms, collectresidues, collectchains and collectmodels return arrays of a particular type from a structural element or element array.

Selectors are functions passed as additional arguments to these functions. Only elements that return true when passed to the selector are retained.

Command Action Return type
collect(struc['A'][50]) Collect the sub-elements of an element, e.g. atoms from a residue Array{AbstractAtom,1}
collectresidues(struc) Collect the residues of an element Array{AbstractResidue,1}
collectatoms(struc) Collect the atoms of an element Array{AbstractAtom,1}
collectatoms(struc, calphaselector) Collect the C-alpha atoms of an element Array{AbstractAtom,1}
collectatoms(struc, calphaselector, disorderselector) Collect the disordered C-alpha atoms of an element Array{AbstractAtom,1}

It is easy to define your own atom, residue, chain or model selectors. The below will collect all atoms with x coordinate less than 0:

xselector(at::AbstractAtom) = x(at) < 0
collectatoms(struc, xselector)

Alternatively, you can use an anonymous function:

collectatoms(struc, at -> x(at) < 0)

countatoms, countresidues, countchains and countmodels can be used to count elements. For example:

julia> countatoms(struc)

julia> countatoms(struc, calphaselector)

julia> countresidues(struc, standardselector)

The sequence of a protein can be retrieved by passing a Chain or array of residues to AminoAcidSequence:

julia> AminoAcidSequence(struc['A'], standardselector)
85aa Amino Acid Sequence:

Writing PDB files

PDB format files can be written:

writepdb("1EN2_out.pdb", struc)

Any element type can be given as input to writepdb. Atom selectors can also be given as additional arguments:

writepdb("1EN2_out.pdb", struc, backboneselector)

Spatial calculations

Various functions are provided to calculate spatial quantities for proteins:

Command Returns
distance Minimum distance between two elements
sqdistance Minimum square distance between two elements
bondangle Angle between three atoms
dihedralangle Dihedral angle defined by four atoms
omegaangle Omega angle between a residue and the previous residue
phiangle Phi angle between a residue and the previous residue
psiangle Psi angle between a residue and the next residue
ramachandranangles Vectors of phi and psi angles of an element
contactmap Contact map of two element, or one element with itself
rmsd RMSD between two elements of the same size - assumes they are superimposed
displacements Vector of displacements between two elements of the same size - assumes they are superimposed

For example:

julia> distance(struc['A'][10], struc['A'][20])

julia> rad2deg(bondangle(struc['A'][50]["N"], struc['A'][50]["CA"], struc['A'][50]["C"]))

julia> rad2deg(dihedralangle(struc['A'][50]["N"], struc['A'][50]["CA"], struc['A'][50]["C"], struc['A'][51]["N"]))

julia> rad2deg(psiangle(struc['A'][50], struc['A'][51]))


A few further examples of Bio.Structure usage are given below.

A) To plot the temperature factors of a protein, if you have Gadfly installed:

using Gadfly
calphas = collectatoms(struc, calphaselector)
     Guide.xlabel("Residue number"),
     Guide.ylabel("Temperature factor"),

B) To print the PDB records for all C-alpha atoms within 5 Angstroms of residue 38:

for at in calphas
    if distance(struc['A'][38], at) < 5.0 && resnumber(at) != 38

D) To view the contact map of a structure:

cbetas = collectatoms(struc, cbetaselector)
contacts = contactmap(cbetas, 7.0)
for i in 1:length(cbetas)
    for j in 1:length(cbetas)
        if contacts[i,j]
            print(" ")

cbetaselector selects C-beta atoms, or C-alpha atoms for glycine residues. contactmap can also be given two structural elements as arguments, in which case a non-symmetrical 2D array is returned showing contacts between the elements.

E) To show the Ramachandran phi/psi angle plot of a structure, if you have Gadfly installed:

using Gadfly
phi_angles, psi_angles = ramachandranangles(struc, standardselector)
     Guide.xlabel("Phi / degrees"),
     Guide.ylabel("Psi / degrees"),

F) To calculate the RMSD and displacements between the heavy (non-hydrogen) atoms of two models in an NMR structure:

struc_nmr = read("1SSU.pdb", PDB)
rmsd(struc_nmr[5], struc_nmr[10], heavyatomselector)
displacements(struc_nmr[5], struc_nmr[10], heavyatomselector)