BioStructures documentation

The BioStructures.jl package provides functionality to manipulate macromolecular structures, and in particular to read and write Protein Data Bank (PDB), mmCIF and MMTF files. It is designed to be used for standard structural analysis tasks, as well as acting as a platform on which others can build to create more specific tools.

It compares favourably in terms of performance to other PDB parsers - see some benchmarks online and the benchmark suite. The PDB, mmCIF and MMTF parsers currently read in the whole PDB without explicit errors (with a handful of known exceptions). Help on individual functions can be found in the BioStructures API section or by using ?function_name from within Julia.


To download a PDB file:

using BioStructures

# Stored in the current working directory by default

To parse a PDB file into a Structure-Model-Chain-Residue-Atom framework:

julia> struc = read("/path/to/pdb/file.pdb", PDB)
ProteinStructure 1EN2.pdb with 1 models, 1 chains (A), 85 residues, 754 atoms

mmCIF files can be read into the same data structure with read("/path/to/cif/file.cif", MMCIF). The keyword argument gzip, default false, determines if the file is gzipped. If you want to read an mmCIF file into a dictionary to query yourself (e.g. to access metadata fields), use MMCIFDict:

julia> mmcif_dict = MMCIFDict("/path/to/cif/file.cif")
mmCIF dictionary with 716 fields

julia> mmcif_dict["_entity_src_nat.common_name"]
1-element Array{String,1}:
 "great nettle"

A MMCIFDict can be accessed in similar ways to a standard dictionary, and if necessary the underlying dictionary of MMCIFDict d can be accessed with d.dict. Note that the values of the dictionary are always an Array{String,1}, even if only one value was read in or the data is numerical.

MMTF files can be read into the same data structure with read("/path/to/mmtf/file.mmtf", MMTF). The keyword argument gzip, default false, determines if the file is gzipped. In a similar manner to mmCIF dictionaries, a MMTF file can be read into a dictionary with MMTFDict. The values of the dictionary are a variety of types depending on the MMTF specification. To convert a MMCIFDict or MMTFDict to the Structure-Model-Chain-Residue-Atom framework, use the ProteinStructure constructor, e.g. ProteinStructure(mmcif_dict).

The elements of struc can be accessed as follows:

CommandReturnsReturn type
struc[1]Model 1Model
struc[1]["A"]Model 1, chain AChain
struc[1]['A']Shortcut to above if the chain ID is a single characterChain
struc["A"]The lowest model (model 1), chain AChain
struc["A"]["50"]Model 1, chain A, residue 50AbstractResidue
struc["A"][50]Shortcut to above if it is not a hetero residue and the insertion code is blankAbstractResidue
struc["A"]["H_90"]Model 1, chain A, hetero residue 90AbstractResidue
struc["A"][50]["CA"]Model 1, chain A, residue 50, atom name CAAbstractAtom
struc["A"][15]["CG"]['A']For disordered atoms, access a specific locationAtom

Disordered atoms are stored in a DisorderedAtom container but calls fall back to the default atom, so disorder can be ignored if you are not interested in it. Disordered residues (i.e. point mutations with different residue names) are stored in a DisorderedResidue container. The idea is that disorder will only bother you if you want it to. See the Biopython discussion for more.

Properties can be retrieved as follows:

FunctionReturnsReturn type
serialSerial number of an atomInt
atomnameName of an atomString
altlocidAlternative location ID of an atomChar
altlocidsAll alternative location IDs in a DisorderedAtomArray{Char,1}
xx coordinate of an atomFloat64
yy coordinate of an atomFloat64
zz coordinate of an atomFloat64
coordscoordinates of an atomArray{Float64,1}
occupancyOccupancy of an atom (default is 1.0)Float64
tempfactorTemperature factor of an atom (default is 0.0)Float64
elementElement of an atom (default is " ")String
chargeCharge of an atom (default is " ")String
residueResidue an atom belongs toResidue
isheterotrue if the residue or atom is a hetero residue/atomBool
isdisorderedatomtrue if the atom is disorderedBool
pdblinePDB ATOM/HETATM record for an atomString
resnameResidue name of a residue or atomString
resnamesAll residue names in a DisorderedResidueArray{String,1}
resnumberResidue number of a residue or atomInt
sequentialresiduesDetermine if the second residue follows the first in sequenceBool
inscodeInsertion code of a residue or atomChar
residResidue ID of an atom or residue (full=true includes chain)String
atomnamesAtom names of the atoms in a residue, sorted by serialArray{String,1}
atomsDictionary of atoms in a residueDict{String,AbstractAtom}
isdisorderedrestrue if the residue has multiple residue namesBool
disorderedresAccess a particular residue name in a DisorderedResidueResidue
chainChain a residue or atom belongs toChain
chainidChain ID of a chain, residue or atomString
residsSorted residue IDs in a chainArray{String,1}
residuesDictionary of residues in a chainDict{String,AbstractResidue}
modelModel a chain, residue or atom belongs toModel
modelnumberModel number of a model, chain, residue or atomInt
chainidsSorted chain IDs in a model or structureArray{String,1}
chainsDictionary of chains in a model or structureDict{String,Chain}
structureStructure a model, chain, residue or atom belongs toProteinStructure
structurenameName of the structure an element belongs toString
modelnumbersSorted model numbers in a structureArray{Int,1}
modelsDictionary of models in a structureDict{Int,Model}

The strip keyword argument determines whether surrounding whitespace is stripped for atomname, element, charge, resname and atomnames (default true).

The coordinates of an atom can be set using x!, y!, z! and coords!.

Manipulating structures

Elements can be looped over to reveal the sub-elements in the correct order:

for mod in struc
    for ch in mod
        for res in ch
            for at in res
                # Do something

Models are ordered numerically; chains are ordered by chain ID character ordering, except the empty chain is last; residues are ordered by residue number and insertion code with hetero residues after standard residues; atoms are ordered by atom serial. If you want the first sub-element you can use first. For example first(struc[1]) gets the first chain in model 1. Since the ordering of elements is defined you can use the sort function. For example sort(res) sorts a list of residues as described above, or sort(res, by=resname) will sort them alphabetically by residue name.

collect can be used to get arrays of sub-elements. collectatoms, collectresidues, collectchains and collectmodels return arrays of a particular type from a structural element or element array. Since most operations should use a single version of an atom or residue, disordered entities are not expanded by default and only one entity is present in the array. This can be changed by setting expand_disordered to true in collectatoms or collectresidues.

Selectors are functions passed as additional arguments to these functions. Only elements that return true when passed to all the selector are retained. For example:

CommandActionReturn type
collect(struc['A'][50])Collect the sub-elements of an element, e.g. atoms from a residueArray{AbstractAtom,1}
collectresidues(struc)Collect the residues of an elementArray{AbstractResidue,1}
collectatoms(struc)Collect the atoms of an elementArray{AbstractAtom,1}
collectatoms(struc, calphaselector)Collect the Cα atoms of an elementArray{AbstractAtom,1}
collectatoms(struc, calphaselector, disorderselector)Collect the disordered Cα atoms of an elementArray{AbstractAtom,1}

The selectors available are:

FunctionActs onSelects for
standardselectorAbstractAtom or AbstractResidueAtoms/residues arising from standard (ATOM) records
heteroselectorAbstractAtom or AbstractResidueAtoms/residues arising from hetero (HETATM) records
atomnameselectorAbstractAtomAtoms with atom name in a given list
calphaselectorAbstractAtomCα atoms
cbetaselectorAbstractAtomCβ atoms, or Cα atoms for glycine residues
backboneselectorAbstractAtomAtoms in the protein backbone (CA, N, C and O)
heavyatomselectorAbstractAtomNon-hydrogen atoms
hydrogenselectorAbstractAtomHydrogen atoms
resnameselectorAbstractAtom or AbstractResidueAtoms/residues with residue name in a given list
waterselectorAbstractAtom or AbstractResidueAtoms/residues with residue name HOH
notwaterselectorAbstractAtom or AbstractResidueAtoms/residues with residue name not HOH
disorderselectorAbstractAtom or AbstractResidueAtoms/residues with alternative locations
allselectorAbstractAtom or AbstractResidueAll atoms/residues

To create a new atomnameselector or resnameselector:

cdeltaselector(at::AbstractAtom) = atomnameselector(at, ["CD"])

It is easy to define your own atom, residue, chain or model selectors. The below will collect all atoms with x coordinate less than 0:

xselector(at) = x(at) < 0
collectatoms(struc, xselector)

Alternatively, you can use an anonymous function:

collectatoms(struc, at -> x(at) < 0)

countatoms, countresidues, countchains and countmodels can be used to count elements with the same selector API. For example:

julia> countatoms(struc)

julia> countatoms(struc, calphaselector)

julia> countresidues(struc, standardselector)

julia> countatoms(struc, expand_disordered=true)

The amino acid sequence of a protein can be retrieved by passing an element to LongAminoAcidSeq with optional residue selectors:

julia> LongAminoAcidSeq(struc['A'], standardselector)
85aa Amino Acid Sequence:

The gaps keyword argument determines whether to add gaps to the sequence based on missing residue numbers (default true). threeletter_to_aa provides a lookup table of amino acid codes should that be required. See BioSequences.jl and BioAlignments.jl for more on how to deal with sequences. For example, to see the alignment of CDK1 and CDK2 (this example also makes use of Julia's broadcasting):

julia> struc1, struc2 = retrievepdb.(["4YC6", "1HCL"])
2-element Array{ProteinStructure,1}:
 ProteinStructure 4YC6.pdb with 1 models, 8 chains (A,B,C,D,E,F,G,H), 1420 residues, 12271 atoms
 ProteinStructure 1HCL.pdb with 1 models, 1 chains (A), 294 residues, 2546 atoms

julia> seq1, seq2 = LongAminoAcidSeq.([struc1["A"], struc2["A"]], standardselector, gaps=false)
2-element Array{BioSequences.LongSequence{BioSequences.AminoAcidAlphabet},1}:

julia> using BioAlignments

julia> scoremodel = AffineGapScoreModel(BLOSUM62, gap_open=-10, gap_extend=-1);

julia> alres = pairalign(GlobalAlignment(), seq1, seq2, scoremodel)
  score: 945
           ||   | ||||||||||||| | | || ||| |||| |    |||||||||||||||| |

           |||| | ||      ||| ||||  ||||  | |   |      | |||| | |||  ||

           || ||||||||||||||   | ||||||||||||||    ||||||||||| || |||

            ||| ||||| | ||||  |   || ||||||||||||| ||||   ||| | |  |||

            ||||        |  ||| |  ||| || ||| |||| | || || | |

In fact, pairalign is extended to carry out the above steps and return the alignment by calling pairalign(struc1["A"], struc2["A"], standardselector) in this case. scoremodel and aligntype are keyword arguments with the defaults shown above.

Spatial calculations

Various functions are provided to calculate spatial quantities for proteins:

distanceMinimum distance between two elements
sqdistanceMinimum square distance between two elements
coordarrayAtomic coordinates in Å of an element as a 2D Array with each column corresponding to one atom
bondangleAngle between three atoms
dihedralangleDihedral angle defined by four atoms
omegaangleOmega dihedral angle between a residue and the previous residue
phianglePhi dihedral angle between a residue and the previous residue
psianglePsi dihedral angle between a residue and the next residue
omegaanglesVector of omega dihedral angles of an element
phianglesVector of phi dihedral angles of an element
psianglesVector of psi dihedral angles of an element
ramachandrananglesVectors of phi and psi angles of an element
ContactMapContactMap of two elements, or one element with itself
DistanceMapDistanceMap of two elements, or one element with itself
showcontactmapPrint a representation of a ContactMap to stdout or a specified IO instance
TransformationThe 3D transformation to map one set of coordinates onto another
applytransform!Modify all coordinates in an element according to a transformation
applytransformModify coordinates according to a transformation
superimpose!Superimpose one element onto another
rmsdRMSD between two elements, with or without superimposition
displacementsVector of displacements between two elements, with or without superimposition
MetaGraphConstruct a MetaGraph of contacting elements

The omegaangle, phiangle and psiangle functions can take either a pair of residues or a chain and a position. The omegaangle and phiangle functions measure the angle between the residue at the given index and the one before. The psiangle function measures between the given index and the one after. For example:

julia> distance(struc['A'][10], struc['A'][20])

julia> rad2deg(bondangle(struc['A'][50]["N"], struc['A'][50]["CA"], struc['A'][50]["C"]))

julia> rad2deg(dihedralangle(struc['A'][50]["N"], struc['A'][50]["CA"], struc['A'][50]["C"], struc['A'][51]["N"]))

julia> rad2deg(psiangle(struc['A'][50], struc['A'][51]))

julia> rad2deg(psiangle(struc['A'], 50))

ContactMap takes in a structural element or a list, such as a Chain or Vector{Atom}, and returns a ContactMap object showing the contacts between the elements for a specified distance. ContactMap can also be given two structural elements as arguments, in which case a non-symmetrical 2D array is returned showing contacts between the elements. The underlying BitArray{2} for ContactMap contacts can be accessed with if required.

julia> contacts = ContactMap(collectatoms(struc['A'], cbetaselector), 8.0)
Contact map of size (85, 85)

A plot recipe is defined for this so it can shown with Plots.jl:

using Plots


For a quick, text-based representation of a ContactMap use showcontactmap.

DistanceMap works in an analogous way to ContactMap and gives a map of the distances. It can also be plotted:

dists = DistanceMap(collectatoms(struc['A'], cbetaselector))
using Plots


Structural elements can be superimposed, and superposition-dependent properties such as the RMSD can be calculated. To carry out superimposition, BioStructures.jl carries out a sequence alignment and superimposes aligned residues using the Kabsch algorithm. For example:

# Change the coordinates of element 1 to superimpose it onto element 2
# Do sequence alignment with standard residues and calculate the transformation with Cα atoms (the default)
superimpose!(el1, el2, standardselector)

# The transformation object for the above superimposition
Transformation(el1, el2, standardselector)

# Calculate the transformation with backbone atoms
superimpose!(el1, el2, standardselector, alignatoms=backboneselector)

# Calculate RMSD on Cα atoms (the default) after superimposition
rmsd(el1, el2, standardselector)

# Superimpose based on backbone atoms and calculate RMSD based on Cβ atoms
rmsd(el1, el2, standardselector, alignatoms=backboneselector, rmsdatoms=cbetaselector)

# Do not do a superimposition - assumes the elements are already superimposed
rmsd(el1, el2, standardselector, superimpose=false)

displacements is used in a similar way to rmsd but returns the vector of distances for each superimposed atom. dispatoms selects the atoms to calculate the displacements on.

These transformation functions may be useful beyond the context of protein structures. For example, Transformation(c1, c2) calculates the transformation to map one set of coordinates to another. The coordinate sets must be the same size and have the number of dimensions in the first axis and the number of points in the second axis.

The contacting elements in a molecular structure form a graph, and this can be retrieved using MetaGraph. This extends MetaGraph from MetaGraphs.jl, allowing you to use all the graph analysis tools in LightGraphs.jl. For example:

julia> mg = MetaGraph(collectatoms(struc["A"], cbetaselector), 8.0)
{85, 423} undirected Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)

julia> using LightGraphs, MetaGraphs

julia> nv(mg)

julia> ne(mg)

julia> get_prop(mg, :contactdist)

julia> mg[10, :element]
Atom CB with serial 71, coordinates [-3.766, 4.031, 23.526]

See the LightGraphs docs for details on how to calculate properties such as shortest paths, centrality measures, community detection and more. Similar to ContactMap, contacts are found between any element type passed in. So if you wanted the graph of chain contacts in a protein complex you could give a Model as the first argument.

Downloading PDB files

To download a PDB file to a specified directory:

downloadpdb("1EN2", dir="path/to/pdb/directory")

To download multiple PDB files to a specified directory:

downloadpdb(["1EN2", "1ALW", "1AKE"], dir="path/to/pdb/directory")

To download a PDB file in PDB, XML, mmCIF or MMTF format use the format argument:

# To get mmCIF
downloadpdb("1ALW", dir="path/to/pdb/directory", format=MMCIF)

# To get XML
downloadpdb("1ALW", dir="path/to/pdb/directory", format=PDBXML)

To apply a function to a downloaded file and delete the file afterwards:

downloadpdb(f, "1ALW")

Or, using Julia's do syntax:

downloadpdb("1ALW") do fp
    s = read(fp, PDB)
    # Do something

Note that some PDB entries, e.g. large viral assemblies, are not available as PDB format files. In this case download the mmCIF file or MMTF file instead.

Reading PDB files

To parse an existing PDB file into a Structure-Model-Chain-Residue-Atom framework:

julia> struc = read("/path/to/pdb/file.pdb", PDB)
ProteinStructure 1EN2.pdb with 1 models, 1 chains (A), 85 residues, 754 atoms

Read a mmCIF/MMTF file instead by replacing PDB with MMCIF/MMTF. Various options can be set through optional keyword arguments when parsing PDB/mmCIF/MMTF files:

Keyword ArgumentDescription
structure_name::AbstractStringThe name given to the returned ProteinStructure; defaults to the file name
remove_disorder::Bool=falseWhether to remove atoms with alt loc ID not ' ' or 'A'
read_std_atoms::Bool=trueWhether to read standard ATOM records
read_het_atoms::Bool=trueWhether to read HETATOM records
gzip::Bool=falseWhether the file is gzipped (MMTF and mmCIF files only)

Use retrievepdb to download and parse a PDB file into a Structure-Model-Chain-Residue-Atom framework in a single line:

julia> struc = retrievepdb("1ALW", dir="path/to/pdb/directory")
INFO: Downloading PDB: 1ALW
ProteinStructure 1ALW.pdb with 1 models, 2 chains (A,B), 346 residues, 2928 atoms

If you prefer to work with data frames rather than the data structures in BioStructures, the DataFrame constructor from DataFrames.jl has been extended to construct relevant data frames from lists of atoms or residues:

julia> df = DataFrame(collectatoms(struc));

julia> first(df, 3)
3×17 DataFrame. Omitted printing of 5 columns
│ Row │ ishetero │ serial │ atomname │ altlocid │ resname │ chainid │ resnumber │ inscode │ x       │ y       │ z       │ occupancy │
│     │ Bool     │ Int64  │ String   │ Char     │ String  │ String  │ Int64     │ Char    │ Float64 │ Float64 │ Float64 │ Float64   │
│ 1   │ false    │ 1      │ N        │ ' '      │ GLU     │ A       │ 94        │ ' '     │ 15.637  │ -47.066 │ 18.179  │ 1.0       │
│ 2   │ false    │ 2      │ CA       │ ' '      │ GLU     │ A       │ 94        │ ' '     │ 14.439  │ -47.978 │ 18.304  │ 1.0       │
│ 3   │ false    │ 3      │ C        │ ' '      │ GLU     │ A       │ 94        │ ' '     │ 14.141  │ -48.183 │ 19.736  │ 1.0       │

julia> df = DataFrame(collectresidues(struc));

julia> first(df, 3)
3×8 DataFrame
│ Row │ ishetero │ resname │ chainid │ resnumber │ inscode │ countatoms │ modelnumber │ isdisorderedres │
│     │ Bool     │ String  │ String  │ Int64     │ Char    │ Int64      │ Int64       │ Bool            │
│ 1   │ false    │ GLU     │ A       │ 94        │ ' '     │ 9          │ 1           │ false           │
│ 2   │ false    │ GLU     │ A       │ 95        │ ' '     │ 9          │ 1           │ false           │
│ 3   │ false    │ VAL     │ A       │ 96        │ ' '     │ 7          │ 1           │ false           │

As with file writing disordered entities are expanded by default but this can be changed by setting expand_disordered to false.

Reading multiple mmCIF data blocks

You can read and write files containing multiple mmCIF data blocks (equivalent to a MMCIFDict in this package) with the readmultimmcif and writemultimmcif functions. An example of such a file is the PDB's chemical component dictionary.

julia> ccd = readmultimmcif("components.cif.gz"; gzip=true);

julia> ccd["2W4"]
mmCIF dictionary with 64 fields

Writing PDB files

PDB format files can be written:

writepdb("1EN2_out.pdb", struc)

Any element type can be given as input to writepdb. The first argument can also be a stream. Atom selectors can also be given as additional arguments:

# Only backbone atoms are written out
writepdb("1EN2_out.pdb", struc, backboneselector)

To write mmCIF format files, use the writemmcif function with similar arguments. The gzip keyword argument, default false, determines whether to gzip the written file. A MMCIFDict can also be written using writemmcif:

writemmcif("1EN2_out.dic", mmcif_dict)

To write out a MMTF file, use the writemmtf function with any element type or a MMTFDict as an argument. The gzip keyword argument, default false, determines whether to gzip the written file.

Unlike for the collection functions, expand_disordered is set to true when writing files as it is usually desirable to retain all entities. Set expand_disordered to false to not write out more than one atom or residue at each location.

Multi-character chain IDs can be written to mmCIF and MMTF files but will throw an error when written to a PDB file as the PDB file format only has one character allocated to the chain ID.

If you want the PDB record line for an AbstractAtom, use pdbline. For example:

julia> pdbline(at)
"HETATM  101  C  A  X B  20      10.500  20.123  -5.123  0.50 50.13           C1+"

If you want to generate a PDB record line from values directly, do so using an AtomRecord:

julia> pdbline(AtomRecord(false, 669, "CA", ' ', "ILE", "A", 90, ' ', [31.743, 33.11, 31.221], 1.00, 25.76, "C", ""))
"ATOM    669  CA  ILE A  90      31.743  33.110  31.221  1.00 25.76           C  "

This can be useful when writing PDB files from your own data structures.

RCSB PDB utility functions

To get the list of all PDB entries:

l = pdbentrylist()

To download the entire RCSB PDB database in your preferred file format:

downloadentirepdb(dir="path/to/pdb/directory", format=MMTF)

This operation takes a lot of disk space and time to complete, depending on internet connection.

To update your local PDB directory based on the weekly status list of new, modified and obsolete PDB files from the RCSB server:

updatelocalpdb(dir="path/to/pdb/directory", format=MMTF)

Obsolete PDB files are stored in the auto-generated obsolete directory inside the specified local PDB directory.

To maintain a local copy of the entire RCSB PDB database, run the downloadentirepdb function once to download all PDB files and set up a CRON job or similar to run updatelocalpdb function once a week to keep the local PDB directory up to date with the RCSB server.

There are a few more functions that may be useful:

FunctionReturnsReturn type
pdbstatuslistList of PDB entries from a specified RCSB weekly status list URLArray{String,1}
pdbrecentchangesAdded, modified and obsolete PDB lists from the recent RCSB weekly status filesTuple{Array{String,1},Array{String,1}, Array{String,1}}
pdbobsoletelistList of all obsolete PDB entriesArray{String,1}
downloadallobsoletepdbDownloads all obsolete PDB files from the RCSB PDB serverArray{String,1}

Visualising structures

The Bio3DView.jl package can be used to visualise molecular structures. For example:

using Bio3DView
using Blink

Use the mouse to move and zoom the images. You can view BioStructures data types:

struc = retrievepdb("1AKE")
viewstruc(struc['A'], surface=Surface(Dict("colorscheme"=> "greenCarbon")))

See the Bio3DView.jl tutorial for more information.

Other packages in the Julia ecosystem that deal with structural bioinformatics or related fields include: