Cookbook: Using PDB files with the OpenFF Toolkit

PDB files are a common way to represent biopolymer structures, but they don’t natively contain all of the chemical information required to construct Molecule objects. However, PDB files are used widely in the molecular simulation community, so OpenFF has implemented functionality to load them.

The recommended pathway for loading PDB files is Topology.from_pdb. Additionally, the PDB file must have the following characteristics:

For protein atoms (ATOM records):

  • Atom names and residue names must be consistent with the Chemical Components Dictonary

  • Only the 20 canonical amino acids are supported by default (including protonated/deprotonated variants)

  • All hydrogens must be explicit

For small molecules, waters, and ions (HETATM records):

  • The elemental symbol of each atom must be identified according to the PDB spec in text columns 77 and 78

  • Bonds must be identified in the CONECT records at the bottom of the file

  • All unique small molecules must be identified in the unique_molecules keyword argument

Information

OpenFF Topology and Molecule objects store much more information than is in the PDB files they might be generated from. Don’t be surprised if this Python code takes a few more seconds to load a PDB file than other tools.

PDB file with only a (single) protein

6hvi_prepared.pdb is protein from the Merck free energy perturbation study. It was prepared for simulation using commercial tools by Schrodinger, and doesn’t have any bulk water, ions, small molecules, or a box.

protein = Topology.from_pdb("6hvi_prepared.pdb")

protein.visualize()

From here, you might use Packmol or PDBFixer to add water or other solvents without leaving Python (or there are tools available for water/solvent packing in AMBER, GROMACS, and many commercial packages). An example of using PDBFixer can be found in our “Toolkit Showcase” example.

PDB file with a protein, waters, and ions

This PDB file comes from the AMBER tutorial on making solvated simulations. We have slightly modified the procedure to create a file suitable for loading into the OpenFF Toolkit:

  • TIP3P water is used instead of OPC (a four-site model)

  • a cubic solvent box is used instead of an octohedron

  • the N terminus of the protein is made neutral

ramp1 = Topology.from_pdb("RAMP1_solv_box_ions.pdb")

ramp1.visualize()

PDB file with one or more ligands

The following PDB is taken from the GROMACS protein-ligand tutorial (with manually-added CONECT records) and contains a protein, water, and a small molecule (ligand).

While the chemistry of water, ions, and proteins can be deduced from known templates (this is what happened under the hood in the previous two examples), small molecules are trickier. The number of amino acids and ions is relatively small, but the number of possible small molecules is tremendous so storing them in a database is not feasible. We instead ask the user to fill in the missing information that would be looked up from the CCD or hard-coded heuristics, specifically the bond order and stereochemistry. This information is stored in a Molecule object(s), which we ask the user to provide via the unique_molecules argument.

If you try to load a PDB file containing a small molecule, without providing the chemistry of the small molecule, you’ll get an error message identifying the atoms that couldn’t be loaded:

Topology.from_pdb("gromacs_solv_complex.pdb")
UnassignedChemistryInPDBError: Some bonds or atoms in the input could not be identified.

Hint: The following residue names with unassigned atoms were not found in the substructure library. While the OpenFF Toolkit identifies residues by matching chemical substructures rather than by residue name, it currently only supports the 20 'canonical' amino acids.
    JZ4


Hint: The following residues were assigned names that do not match the residue name in the input, or could not be assigned residue names at all. This may indicate that atoms are missing from the input or some other error. The OpenFF Toolkit requires all atoms, including hydrogens, to be explicit in the input to avoid ambiguities in protonation state or bond order:
    Input residue X:JZ4#0001 contains atoms matching substructures {'No match'}

Error: The following 22 atoms exist in the input but could not be assigned chemical information from the substructure library:
    Atom  2614 (C4) in residue X:JZ4#0001
    Atom  2615 (C7) in residue X:JZ4#0001
    Atom  2616 (C8) in residue X:JZ4#0001
    Atom  2617 (C9) in residue X:JZ4#0001
    Atom  2618 (C10) in residue X:JZ4#0001
    Atom  2619 (C11) in residue X:JZ4#0001
    Atom  2620 (C12) in residue X:JZ4#0001
    Atom  2621 (C13) in residue X:JZ4#0001
    Atom  2622 (C14) in residue X:JZ4#0001
    Atom  2623 (OAB) in residue X:JZ4#0001
    Atom  2624 (H1) in residue X:JZ4#0001
    Atom  2625 (H2) in residue X:JZ4#0001
    Atom  2626 (H3) in residue X:JZ4#0001
    Atom  2627 (H4) in residue X:JZ4#0001
    Atom  2628 (H5) in residue X:JZ4#0001
    Atom  2629 (H6) in residue X:JZ4#0001
    Atom  2630 (H7) in residue X:JZ4#0001
    Atom  2631 (H8) in residue X:JZ4#0001
    Atom  2632 (H9) in residue X:JZ4#0001
    Atom  2633 (H10) in residue X:JZ4#0001
    Atom  2634 (H11) in residue X:JZ4#0001
    Atom  2635 (H12) in residue X:JZ4#0001

Any OpenFF Molecule object with the appropriate atoms and connectivity can be used to identify the small molecule when loading the PDB file. See the Molecule Cookbook for some of the numerous ways to create them! Since the coordinates are specified in the PDB file, it’s not necessary that this Molecule has conformers.

In this case we know the identity of the ligand and can look up a SMILES string, which is all that’s needed to load this PDB file! Since there are many molecules in this file, it may take a few tens of seconds to load and visualize - this is expected and is a result of carefully checking and applying chemical information.

jz4 = Molecule.from_smiles("CCCC1=CC=CC=C1O")
complex = Topology.from_pdb("gromacs_solv_complex.pdb", unique_molecules=[jz4])

complex.visualize()