Investigate APO and HOLO malaria targets

Introduction

This guide will provide a brief overview of the APO (without ligand) and HOLO (with ligand) states of protein kinase G (PKG), a key regulatory enzyme found in the malaria parasite Plasmodium. In Plasmodium vivax, a species responsible for severe forms of malaria in humans, PKG plays a critical role in various stages of the parasite’s lifecycle, including its development within the mosquito vector and its replication within the human host’s red blood cells. PKG is involved in signalling pathways that regulate essential processes such as parasite invasion of host cells, egress from infected cells, and differentiation into sexual forms for transmission to mosquitoes. Additionally, PKG has been identified as a potential target for antimalarial drug development due to its essential role in parasite survival and its divergence from human homologs, making it an attractive candidate for selective inhibition.

Several experimental structures have been deposited for PKG with UNIPROT A5K0N4. The PDB IDs user in this tutorial are 5DYL for the APO form, 5DZC for the HOLO form bound to a non-hydrolyzable analog of adenosine tri-phosphate (here named ANP), 5F0A for the HOLO form bound to an inhibitor (here named 1FB), and 5FET for the HOLO form bound to another inhibitor (here named 1TR). The structures of the ligands are shown in Figure 1.

Figure 1: 2D representation of ANP, 1TR, and 1FB ligand molecules.

The exact role of these molecules when targeting PKG is not strictly relevant for this tutorial. If you are interested you can read the corresponding publication (El Bakkouri et al. 2019) from which the APO and HOLO + ANP structures were taken, while the other two come from results still under peer revision and, therefore, unpublished. Differently from GPCRs in the other exercise, PKG is not a membrane protein, and as such it can be simulated in a box of water, analogously to what done during the Lysozyme tutorial.

A look into the starting files

First of all, start by downloading the exercise directories from the github repository.

Download PKG
APO

Download PKG
HOLO - 1FB

Download PKG
HOLO - 1TR

Download PKG
HOLO - ANP

We provide four different systems. These are always the same protein (PKG) but in different states, i.e., ligand-less (APO) or ligand-bound (HOLO) with three different molecules. You should download and run simulations of at least two of them, one being the APO and one a HOLO so that you can compare the structures at the end of the molecular dynamics simulation. You can unzip the downloaded directories with the command

tar -xvf name_of_compressed_file.tar.gz

The individual directories of the different protein/ligand systems are organized as in the following

PKG_APO/HOLO_X_excercise
│   ionize.mdp
│   reference_topology_PKG_XXX.gro
│   reference_topology_PKG_XXX.top
│   sbatch_me.sh
│
└─── forcefield
│   │   [various forcefields files]
│
└─── step1_em
│   │   em.mdp
│
└─── step2_nvt
│   │   [various nvt.mdp files]
│
└─── step3_npt
│   │   [various npt.mdp files]
│
└─── step4_prod
    │   prod.mdp

Going by order, you can see

The ionize.mdp, reference_topology_PKG_XXX.gro, and reference_topology_PKG_XXX.top files. These are the building blocks to set up your starting configuration. These will be covered in depth later in the tutorial.
A sbatch_me.sh file. This is a text file that contains the set of instructions that will run your starting configuration through energy minimization, NVT, NPT, and eventually the production phase.
A forcefield directory. Inside this, you can find several text files that define ‘numerically’ the key components of your system. You might recognize some of them, like PKG.itp, the water model tip3.itp, and the ligands 1FB.itp, 1TR.itp, and ANP.itp. You can peek inside them and take a look at how a molecule is described numerically in MD simulations. The format of these files - that is, how these numbers are organized and reported - depends on the software you are using. Since in this course you use GROMACS, these files are written in GROMACS format. However, while these numbers and these files might be different for other software, the core idea of numerically representing with specific functions the intra- and inter-molecular interactions is paramount in molecular dynamics.
A solutions_files directory. Here there are the start.gro and topol.top files that you should obtain at the end of this tutorial. If you want to run the simulations instantly, then move the files from this directory directly to PKG_APO/HOLO_X_excercise and go at the last sections of this tutorial about running the simulation.
The step1_em, step2_nvt, and step3_npt directories. Inside each of these you can find one or more mdp (molecular dynamics parameters) files that will be used to compile and run the energy minimization, the NVT, and the NPT equilibrations. These will be run sequentially by the script sbatch_me.sh, taking your starting configuration through a set of equilibration phases to relax the starting configuration gradually and avoid the system exploding.
The step4_prod directory. Here there is the prod.mdp file, which is the final mdp file for the production run of the system. This is usually the longest part of the simulation, which can take 3 to 5 days on Baobab depending on the system. The final output files of the simulation will be inside this directory, and the analysis for the exam will mainly revolve around the prod.xtc trajectory file.

The idea of this tutorial is to give you the main ingredients to build your own solvated simulation box without going through the hassle of finding a good target to simulate, fix the experimental structure files, find a consistent force field to describe the system, and embedding the protein in the membrane. Considering this, inside the main directory you can find a copy of three files which will be the basic blocks to build the system, namely

ionize.mdp, the mdp file to add the ions in the system;
reference_topology_PKG_XXX.gro, the starting structure that contains the structure of PKG in its APO form or in its HOLO form alongside the corresponding docked ligand;
reference_topology_PKG_XXX.top, the starting topology of the system.

From these, and by using as reference the Lysozyme in water tutorial for solvation and ion addition, you should be able to obtain two files, the starting configuration of the solvated protein, which you will name start.gro, and the updated topology of the system, which you will name topol.top. These files will be then picked up by the sbatch_me.sh script and used as starting point to run the equilibration and the production runs of the system.

Note

This tutorial shows the set-up of the simulation box by using as example the HOLO+ANP structure. Nevertheless, the logic of the procedure is identical and can be applied directly to all the other configurations as well.

Preparing the HPC environment

Login into Baobab and send the directory with the exercises to your home in Baobab (with scp), or directly download it there (with wget). Then, request a node with and interactive job for a couple of hours with salloc in the following way

salloc --ntasks=1 --cpus-per-task=4 --partition=private-gervasio-cpu --time=120:00

Finally, source the GROMACS installation

module load GCC/11.3.0
module load OpenMPI/4.1.4
module load GROMACS/2023.1-CUDA-11.7.0

and verify that the sourcing was okay by typing

gmx --version

You are now ready to assemble the starting configuration of your system.

Hardware

Notice how here you are requesting 4 CPUs but no GPU, differently from the HPC tutorial (in fact, you are on --partition=private-gervasio-cpu). In this case, you are going to use the allocation on Baobab just to set up the box and not to run the simulation, so the GPU is not needed.

Read the manual

GROMACS has a lot (> 100) of tools that are accessible by typing gmx followed by a keyword. In this tutorial you will use solvate, grompp, genion, and make_ndx. If you have any doubts remember that you can look online for the explanation of the tool and which flags are needed (-f, -o, -s etc.). For example, this is the manual page of gmx solvate. All this information is also available on the spot if you type -h or --h (for help) after the tool’s name, e.g., gmx solvate --h.

Building the starting box

At the beginning, you can take a look at the starting topology (reference_topology_PKG_HOLO_ANP.top) and the starting configuration to solvate (reference_topology_PKG_HOLO_ANP.gro). The topology reads like this

#include "./forcefield/forcefield.itp"
#include "./forcefield/PKG.itp"
#include "./forcefield/ANP.itp"
#include "./forcefield/tip3.itp"
#include "./forcefield/ions.itp"

[ system ]
PKG HOLO with ANP and MG

[ molecules ]
; Compound         #mols
PKG                    1
MG                     1
ANP                    1
SOL                    2

First of all, notice that there are lines starting with a semi-column ;. These are comments, that is, lines that are not read by GROMACS. Comments are used to annotate the files and write down details that are helping you - the user - remember what you are doing, what is the meaning of some variables, etc. They are ignored by the software and should be informative for the person writing or for the person supposed to read the code. Feel free to write your own comments, if it helps you, just remember to put a ; at the beginning of each line that you do not want to be read by GROMACS, otherwise you will receive a fatal error.

Following up, the #include statements tell GROMACS where to find the elements of the force field. As can be seen, they are collected inside the forcefield directory and must appear with a specific order. First, the set of parameters defining the force field (forcefield.itp). Then, the definition of the individual molecules (PKG.itp, ANP.itp, etc.) that will populate you system. These do not have to appear in a specific order, however all the molecules that you intend to use must be defined here. Building the topology, that is, filling in this file, is roughly the equivalent of running gmx pdb2gmx on the pdb file of the protein, as you did in the Lysozyme tutorial. In this case the system is much more complex and the pdb needs further handling. After the #include statements, there is the [ system ] section, which is simply the name of the system. It is worth giving the system a meaningful name to help in recognising the systems in the future.

Finally, there is the [ molecules ] section, which must contain all the molecules of the system in the order in which they appear in the box. Notice that, for the time being, the topology contains one protein (PKG), one magnesium ion (MG), one ligand (ANP), and two water molecules (SOL). This is unique for the HOLO system with ANP, because the ANP ligand is an ADP analogue which is coordinated to a magnesium ion together with two water molecules. As such, the whole complex ANP + MG + 2 H₂O is the correct physiological state of the HOLO state of this protein when bound to ANP. The ligand doesn’t bind without the coordinating water molecules and magnesium ion.

GROMACS (and other modelling software) doesn’t know if a ligand necessitates some ions and coordinating water molecules. Thus, they are part of the initial box before the solvation as they have been modelled already in the correct state. This is where the biological and chemical knowledge of the user is critical for building a physiologically correct system. Among the systems presented in this tutorial, ANP is the only ligand that necessitates other molecules to be modelled properly in its binding site. The topologies of the other structures contain only the host protein PKG and a ligand (for the HOLO states) or the protein alone (for the APO case). The APO structure and the three ligands are shown in the panels of Figure 2.

Figure 2: APO and HOLO states of PKG with ANP, 1FB, and 1TR.

Now, take a look at the starting configuration reference_topology_PKG_HOLO_ANP.gro. The .gro file has a fixed format, and it is better to not modify it by hand if you are not completely sure about what you are doing. The top of the file looks like this

PKG HOLO with ANP ligand
12908
    1ACE     H1    1  12.255   9.683   1.920
    1ACE    CH3    2  12.282   9.770   1.861
    1ACE     H2    3  12.301   9.855   1.927
    1ACE     H3    4  12.202   9.795   1.792
    1ACE      C    5  12.405   9.741   1.784
    1ACE      O    6  12.461   9.632   1.791
    2GLY      N    7  12.452   9.840   1.705
    2GLY      H    8  12.403   9.929   1.702
[...]

The first line contains the title of the box (PKG HOLO with ANP ligand), the second line contains the number of the atoms in the box (12908), and the following ones report in order all the atoms of the system. These are organized usually as nine columns. The first (here 1ACE) is the specific number and name of the residue, which in this case is the capped N-terminal of PKG. The second column contains the specific name of the atom, and usually the first letter indicates the element (here you have a hydrogen followed by a carbon, two hydrogen atoms, another carbon, and so on). The third is simply the number of the entry. It always starts with 1 and goes up to the number of elements in the box. Then, columns four to six contain the x, y, and z coordinates of that atom, while the columns seven to nine contain its velocity, reported by axial component. Here you can see how the velocities are not reported, and so you have only six columns. All atoms must always have a position, but might have zero or undefined velocity. This is the case now, as you are building the box from a static experimental image. One of the main roles of the equilibrations phase is this - to relax the starting positions and assign reasonable starting velocities to all the atoms.

At the other end of the file, the last lines look like this

[...]
  800ANP    HC812901   9.954  11.756   3.126
  800ANP   HN3B12902  10.448  11.664   2.794
  801SOL     OW12903  10.681  11.334   3.167
  801SOL    HW112904  10.690  11.235   3.182
  801SOL    HW212905  10.712  11.356   3.073
  802SOL     OW12906  10.542  11.544   3.341
  802SOL    HW112907  10.543  11.515   3.439
  802SOL    HW212908  10.540  11.644   3.339
   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000

The meaning of the columns is the same as before. The last molecules to appear are the two water molecules (and in fact, in the topology file, they are reported last). The last line of a .gro file has, like the first two, a special meaning, as it reports the side length of the box along x, y, and z. Sometimes it can have more than three numbers for particular box shapes. You can see how this box is actually undefined - GROMACS keeps this line used, as it has a special role, but it is filled with zeros. You will need to define a box before solvating the system.

You can now set up the box by using gmx editconf

gmx editconf -f reference_topology_PKG_HOLO_ANP.gro -o reference_topology_PKG_HOLO_ANP_boxed.gro -bt dodecahedron -c -d 1

Here you are starting from the reference structure (-f reference_topology_PKG_HOLO_ANP.gro) and asking GROMACS to position your protein in the centre (-c), build a dodecahedron box (-bt dodecahedron), and make sure that the protein is at least 1 nm distant (-d 1) from the box sides to avoid periodic boundary conditions (PBCs) artifacts. Please note that, exactly because of PBCs, there is not a real centre of the box, and positioning the protein in it with -c is mostly for representation purposes rather than real physics.

Differently from the Lysozyme tutorial, here you are also specifying the geometry of the box. In terms of simulations results, the outcome should not depend on the dimension or shape of the simulation box, as long as it is large enough to accommodate the contents (the protein in this case) and avoid self-interaction between PBC images. So, why bother changing the shape or dimension? Why don’t just make a huge cube? A hint should come from one of the last lines of the output of gmx editconf, which will look something like this

new box volume  :1496.67               (nm^3)

Try now to build the same system (but do not use it) by specifying a cubic (-bt cubic) box, you will end up with something like this

new box volume  :2116.61               (nm^3)

This means that you will have a roughly ~40% larger box, and so more water to solvate, a higher number of molecules to simulate, and consequently much slower simulations.

Solvating

Before running gmx solvate, you have to know which water model you want to use. For this force field, as also reported in the reference_topology_PKG_HOLO_ANP.top file, the model is TIP3P (you are importing the parameters with this line #include "./forcefield/tip3.itp"), a three-point water model. This means that each water molecule in your simulation will simply be represented with three sites: one oxygen atom and two hydrogen atoms. To access a three-points water model, the flag name for gmx solvate is -cs spc216.gro, the same as for the Lysozyme tutorial.

Thus, you can solvate the system with the following

gmx solvate -cp reference_topology_PKG_HOLO_ANP_boxed.gro -cs spc216.gro -o reference_topology_PKG_HOLO_ANP_solvated.gro

where you are asking to add water to the structure with -cp reference_topology_PKG_HOLO_ANP_boxed.gro, use as reference a three-points water model with -cs spc216.gro, add call the resulting output structure -o reference_topology_PKG_HOLO_ANP_solvated.gro. Some of you may notice that in the Lysozyme tutorial you also have to pass the topology of the system with the -p flag. This is not mandatory, and has the advantage that the updated topology would have the same name as the input one, which makes harder to trace back after possible errors. However, if you don’t pass the topology then you will have to update it by hand to include the presence of the water molecules.

The (last lines of the) output of this command will look something like this

[...]
Volume                 :     1496.67 (nm^3)
Density                :     999.599 (g/l)
Number of solvent molecules:  44578

GROMACS tries to fill the box with water to reach the density of ca. 1 g/L. The most important part here is the number of water molecules inserted, in this case 44578 (this number can oscillate slightly, as molecules are randomly inserted). You now has just to add this number at the end of the topology to tell GROMACS that the content of the box has changed.

First, copy the reference topology

cp reference_topology_PKG_HOLO_ANP.top reference_topology_PKG_HOLO_ANP_solvated.top

Then, add the amount of water molecules to reference_topology_PKG_HOLO_ANP_solvated.top by correcting the [ molecules ] section in the following way

[...]
[ molecules ]
; Compound         #mols
PKG                    1
MG                     1
ANP                    1
SOL                44580

You have to add the name of the water in the box (SOL) and the number reported after tunning gmx solvate. For the HOLO + ANP structure, SOL was already reported as there are two water molecules coordinating the MG ion, and as such you just have to add two to the total solvation number. For the other structures, just report the plain number obtained from the output of gmx solvate.

At this point, the structure of the solvated system is contained in reference_topology_PKG_HOLO_ANP_solvated.gro, while its topology in reference_topology_PKG_HOLO_ANP_solvated.top.

Adding ions

You are now ready to add the ions in the system. First, you need to generate a .tpr file, which is a binary file that GROMACS can read to understand the charges of the different molecules in the system and select automatically how many ions should be added. You can do this by using gmx grompp and pointing to the parameters file ionize.mdp with the flag -f, to the starting solvated structure reference_topology_PKG_HOLO_ANP_solvated.gro with the flag -c, to the solvated topology reference_topology_PKG_HOLO_ANP_solvated.top with the flag -p, and finally name the output tpr file ionize.tpr with the flag -o

gmx grompp -f ionize.mdp -c reference_topology_PKG_HOLO_ANP_solvated.gro -p reference_topology_PKG_HOLO_ANP_solvated.top -o ionize.tpr

You might see (depending on GROMACS version) that for some systems this command fails with a fatal error containing the following information

NOTE 2 [file reference_topology_PKG_HOLO_ANP_solvated.top, line 15]:
  System has non-zero total charge: -3.000000
  Total charge should normally be an integer. See
  http://www.gromacs.org/Documentation/Floating_Point_Arithmetic
  for discussion on how close it should be to an integer.

WARNING 1 [file reference_topology_PKG_HOLO_ANP_solvated.top, line 15]:
  You are using Ewald electrostatics in a system with net charge. This can
  lead to severe artifacts, such as ions moving into regions with low
  dielectric, due to the uniform background charge. We suggest to
  neutralize your system with counter ions, possibly in combination with a
  physiological salt concentration.

Basically, a tpr file is used to run simulations. As such, when GROMACS tries to prepare one through gmx grompp, it check if the physics of the system is reasonable or not. In this case, it found that your system has a net charge of -3, and is telling you that this is very bad as the systems should always have zero total net charge. You can bypass this check by adding the flag --maxwarn 1 to the command, that is, ignore one (and only one) warning

gmx grompp -f ionize.mdp -c reference_topology_PKG_HOLO_ANP_solvated.gro -p reference_topology_PKG_HOLO_ANP_solvated.top -o ionize.tpr --maxwarn 1

You will see that now GROMACS still complains, but gets the job done. It is very important, however, to understand that the ionization step is nearly the only case in which it is okay to use the --maxwarn flag, as you know that the physics of the system is wrong and you actually need the tpr to fix it with gmx genion.

Silencing warnings

You may be temped to use --maxwarn to get through errors, but generally you should never use this flag without understanding the problem. If there is a major warning then you have to check what is happening. With --maxwarn you can compile the binary but the simulation will probably fail instantly the moment you try to run it, or, worse, it will run but you are likely to produce garbage results due to overlooking fundamental physics mistakes in the box preparation.

Now, you should have a ionize.tpr file in your directory, following the gmx grompp command. You are ready to insert the ions with the following command

gmx genion -s ionize.tpr -neutral -pname NA -nname CL -o start.gro

Here, you are asking GROMACS to make the system neutral (-neutral), call the positive ions NA, the negative ions CL, and call the resulting output start.gro. Trivially, within this force field the atoms NA and CL refer to sodium (Na, +1) and chlorine (Cl, -1) ions. Again, differently from the Lysozyme tutorial, you are not giving as input the topology and you will update it after the addition of ions.

With genion, GROMACS tries to substitute some molecules in the system with the necessary number of ions. You will be promted by GROMACS to choose which part of the system you are okay to substitute in favour of water molecules

Will try to add 3 NA ions and 0 CL ions.
Select a continuous group of solvent molecules
Group     0 (         System) has 146642 elements
Group     1 (        Protein) has 12857 elements
Group     2 (      Protein-H) has  6392 elements
Group     3 (        C-alpha) has   797 elements
Group     4 (       Backbone) has  2392 elements
Group     5 (      MainChain) has  3191 elements
Group     6 (   MainChain+Cb) has  3933 elements
Group     7 (    MainChain+H) has  3973 elements
Group     8 (      SideChain) has  8884 elements
Group     9 (    SideChain-H) has  3201 elements
Group    10 (    Prot-Masses) has 12857 elements
Group    11 (    non-Protein) has 133785 elements
Group    12 (            Ion) has     1 elements
Group    13 (          Other) has    44 elements
Group    14 (             MG) has     1 elements
Group    15 (            ANP) has    44 elements
Group    16 (          Water) has 133740 elements
Group    17 (            SOL) has 133740 elements
Group    18 (      non-Water) has 12902 elements
Group    19 ( Water_and_ions) has 133741 elements
Select a group:

From the first line you can see that GROMACS understood that the system has a total of -3 charge and will then need three positive ions to have a total of zero net charge. Since you do not want to substitute any part of the protein or the ligand, choose the SOL group, and GROMACS will randomly pick seven water molecules, remove them and place there a sodium ion.

Select a group: 17
Selected 17: 'SOL'
Number of (3-atomic) solvent molecules: 44580
Using random seed -17319946.
Replacing solvent molecule 3201 (atom 22505) with NA
Replacing solvent molecule 2493 (atom 20381) with NA
Replacing solvent molecule 18348 (atom 67946) with NA

As for the solvation, you need now to update the topology as some water molecules have been removed and some sodium ions have been added. Start by copying the solvated topology into a new topology called topol.top

cp reference_topology_PKG_HOLO_ANP_solvated.top topol.top

and change the [ molecules ] section by removing the three water molecules from the total and adding the three sodium ions (NA)

[...]
[ molecules ]
; Compound         #mols
PKG                    1
MG                     1
ANP                    1
SOL                44577
NA                     3

If the gmx grompp command didn’t fail, it is because the system has already total charge equal zero. Thus, when you run the gmx genion -s ionize.tpr -neutral -pname NA -nname CL -o start.gro, GROMACS will not prompt you to ask which part of the system you want to substitute in favour of ions, as there are no ions to add, and the output will look like this

No ions to add, will just copy input configuration.

Basically, you have copied the configuration stored in ionize.tpr, which is the one in reference_topology_PKG_HOLO_ANP_solvated.gro, and called it start.gro. You should now copy the topology and call it topol.top, but you do not have to update it as no water molecules have been removed and no ions have been added, and you are good to go.

Salt concentration

Life doesn’t happen in pure water. Generally, during the system preparation a certain amount of salt is added to reach a given physiological salt concentration. These ions are added on top of those needed to neutralize the system. If you want a more realistic set-up, you can ionize the system to 0.150 M with

gmx genion -s ionize.tpr -neutral -pname NA -nname CL -conc 0.150 -o start.gro

but remember to update accordingly the topology with both NA and CL atoms.

Summarizing, you now have the starting solvated and neutralized configuration stored in start.gro and the corresponding topology in topol.top. You can take a look at the final system with VMD. It should look similar to that shown in Figure 3.

Figure 3: Final solvated dodecahedron box for the PKG+ANP HOLO system. Water is represented as a transparent white surface. PKG is in a red cartoon representation. ANP is represented as sticks. Ions are represented with their van der Waals radii and are in blue (sodium) and pink (magnesium). Water is reported as a transparent white surface. The chlorine ions are shown as yellow spheres based on their van der Waals radii. Remember that all the atoms in the system are actually points without a radius described by a set of coordinates, they are not spheres. However, you can use their van der Waals radii to estimate how ‘large’ the atoms are, that is, how much space around them is physically precluded to other atoms due to the atomic repulsion of the inter-molecular van der Waals forces.

A look at the `sbatch_me.sh` file

The content of the exercise directory should now be similar to this

PKG_APO/HOLO_X_exercise
│   sbatch_me.sh
│   start.gro
│   topol.top
│
└─── forcefield
│
└─── step1_em
│   │   em.mdp
│
└─── step2_nvt
│   │   [various nvt.mdp files]
│
└─── step3_npt
│   │   [various npt.mdp files]
│
└─── step4_prod
    │   prod.mdp

plus some leftover files from the system generation. It is mandatory that sbatch_me.sh, start.gro, and topol.top are together in the same directory in which there are the energy minimization, NVT, NPT, and production directories, otherwise the script sbatch_me.sh won’t be able to run the simulations for you.

During both the Lysozyme tutorial and the preparation of the box for this exercise, you logged in a Baobab node by using the salloc command. In this way you can have access to a node and run an interactive job. This is nice because you can have real time answers from the node and you see clearly what is going on and what you are doing, which is pivotal for error-prone procedures like the generation of the starting box. However, when you close the connection, log out from Baobab, or simply turn off the computer, you lose the access to the computer and any running simulation will stop. Does this mean that you should be always connected? And that you should stay constantly in front of your computer, even for a few days at a time, while the simulations run, scared or losing the internet connection and see your runs fail?

Clearly this is not the case. As explained also in the Introduction to HPC, Baobab has a so-called queueing system, named slurm. With slurm, you can prepare a submission script that contains the main commands that you want to run alongside the resources that you need. This is exactly what the sbatch_me.sh script is.

Take a look at the content of the sbatch_me.sh. The first lines look like this

#!/bin/bash 

#SBATCH --account=gervasio_teach_19h330
#SBATCH --partition=private-gervasio-gpu
#SBATCH --time 144:00:00
#SBATCH --job-name PKG_ANP
#SBATCH --error jobname-error.e%j
#SBATCH --output jobname-out.o%j
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 16
#SBATCH --nodes 1
#SBATCH --gpus-per-node=nvidia_geforce_rtx_3080:1
#SBATCH --hint=nomultithread

[...]

You can recognize some of the flags that you were giving to the salloc command, such as --partition=private-gervasio-gpu and --cpus-per-task 16. Basically these lines are specifying which kind of hardware you need to run the simulations. If you are curious, you can look up their meaning in the slurm website.

Slurm reads this files and sends you in a queue while it waits to find some computer that is available and that has the requested hardware. When it finds it, it takes control of it for the amount of time specified in --time 144:00:00, that is, six days. This amount of time will be sufficient to both conclude the equilibration and the production runs.

Continuing along the sbatch_me.sh script, as in the interactive case, the first thing to do once the node is allocated is to source GROMACS, which is the first thing done by the script with the lines following the #SBATCH node requests, i.e., with the following

[...]
module load GCC/11.3.0
module load OpenMPI/4.1.4
module load CUDA/11.7.0
module load GROMACS/2023.1-CUDA-11.7.0
[...]

Then there are a few technical details about how GROMACS should use the available resources to run the simulations with gmx mdrun (e.g. -ntmpi, -ntomp, etc.).

Lastly, there are a series of commands that enter with cd the step1_em, step2_nvt, step3_npt, and step4_prod directories and systematically run gmx grompp to prepare the corresponding tpr files and then run them with gmx mdrun.

You can send the submission script sbatch_me.sh and let slurm handle the simulations by using the sbatch command. This should be run from Baobab’s head-node, and not from the node that you received by running salloc. Thus, from the PKG_APO or HOLO_X_excercise directory where the submission script sbatch_me.sh is located, you can submit the job slurm with the following command

sbatch sbatch_me.sh

If the command runs successfully, slurm returns the number that is assigned to your job, something like this

Submitted batch job 16568567

You can check the status of the job by running (and substituting username with yours)

squeue -u username

The output will look something like this

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16568567 private-g    PKG_ANP username  R       2:11      1 gpu039

The R under the column ST (for status) stands for running. Usually, the job stays in the queue with some other status, like PD (for pending), before changing to R. Once it’s running, you will see that slurm directly writes the output to the exercise directories, e.g., you will see em.gro, em.trr, and other output files appearing in step1_em, nvt_1.log, nvt_1.gro etc. in step2_nvt, and so on. Now that the job is submitted and running, you can log out from Baobab without risking to stop it. For the malaria target exercises, the time needed to complete the whole equilibration phase (steps 1 to 3) is of about two hours, while to complete the production it will take roughly two days, so a total of two days overall.

A look at the process of equilibration

Differently from the Lysozyme tutorial, here you can see that the NVT and NPT simulations are not unique, but they consist of several steps each, which are run one after the other. For the sake of completeness, you can take a look at the various nvt_X.mdp and npt_X.mdp files. As for the other files, the lines starting with ; are comments, that is, they are ignored by GROMACS and are useful only for human readers to organize the code.

Let’s take a look at a few key entries in the mdp file.

[...]
;---------------------------------------------
; INTEGRATOR AND TIME
;---------------------------------------------
integrator               = md
dt                       = 0.002
nsteps                   = 50000
[...]

Here you are asking GROMACS to use a leapfrog integrator by setting integrator = md. The integration is performed with a time step dt of 0.002 ps, that is, 2 fs, which is the standard for this type of simulations. Lastly, the integration is performed for 50000 steps, as set by nsteps, that is, 50000 x 0.002 ps = 100ps. You can see that the number of steps is relatively short with respect to the production run, as reported in the prod.mdp parameters file.

The other most important section is the one where you set the thermostat and barostat settings. For the NVT runs, you have the thermostat that controls the temperature and the parameters look something like this (here taken from nvt_6.mdp)

[...]
;----------------------------------------------
; THERMOSTAT AND BAROSTAT
;----------------------------------------------
; >> Temperature
tcoupl                   = v-rescale
tc-grps                  = Protein non-Protein
tau_t                    = 0.5     0.5
ref_t                    = 300     300
; >>  Pressure
pcoupl                   = no
[...]

The thermostat is connected separately to the two phases of your system. As for the GPCRs systems, the components in the box are coupled separately to the thermostat to achieve better thermostatting (as specified by tc-grps). However, differently from the GPCR exercise, you didn’t have to produce an index file here because the names you are using for the groups are standard ones that are automatically recognized by GROMACS. Notice how there is no pressure coupling (pcouple = no). For the NPT runs, instead, the barostat is activated and the box can oscillate and change its volume. You can see the parameters of the barostat in the THERMOSTAT AND BAROSTAT sections for any of the npt_X.mdp files.

[...]
; >>  Pressure
pcoupl                   = C-rescale
pcoupltype               = isotropic
tau-p                    = 1.0
compressibility          = 4.5e-5
ref-p                    = 1.0
[...]

The reason why there are several NVT and NPT sub-phases is that the system should be relaxed gradually, otherwise it might distort the starting configuration, which would be detrimental - if not deadly - for maintaining the correct binding pose of the ligand. This can be achieved by putting so-called restraints on specific atoms of the system to restrain them in space, that is, to not let them move too much. This behavior is controlled by the parameters defined under the POSITION RESTRAINTS section, as in the following example taken from nvt_1.mdp

[...]
;---------------------------------------------
; POSITION RESTRAINTS
;---------------------------------------------
define                    = -DPOSRES_PRO -DPOSRES_FC_SC=500 -DPOSRES_FC_BB=1000 -DPOSRES_IONS -DPOSRES_MG=1000 -DPOSRES_LIG=1000
[...]

The technical implementation and the specific meaning of these parameters is beyond the scope of this tutorial. The take at home message is that, due to the restraints, at the beginning both the ligand and the protein (and the MG ion for the ANP system) can’t really move freely in space, but water can. In this way you can thermalize first the water molecules. Then, gradually, the restraints on the atoms of the system are decreased to let the whole system equilibrate. You can check the values of the parameters in the POSITION RESTRAINTS of the nvt_X.mdp and npt_X.mdp files. You will see that they gradually decrease up to disappearing in the last NPT equilibration. You can also see that the nvt_X.mdp files actually have an increasing temperature, so that the system is thermalized gradually to avoid distortions of the experimentally-resolved binding site.

The production run is usually simulated without restraints at all, and in fact there is no restraint defined in prod.mdp, as you would like to simulate the ‘natural’ behavior of the system. In this tutorial, the statistical ensemble of the production run is still the isothermal-isobaric ensemble (NPT), as you can see from the prod.mdp file since both the thermostat and the barostat are active. It is informally called production because it is the part of the simulation that is used most of the times for the analysis, as it comes after the equilibration and it is supposed to be well relaxed.

Tips and tricks for handling trajectories

Trajectories are fidgety things. The standard format of the GROMACS trajectory, an .xtc file, contains the positions of the atoms as a function of time, collected as a sequence of frames. The frequency of the output is set by the keyword nstxout-compressed in the mdp files, in units of time steps. For the protein/ligand productions, this is 10000, that is, 10000 x 0.002 ps = 20 ps. Considering that the run is long 250 ns, this means that you will collect 12500 frames. The content of the frames is defined by the keyword compressed-x-grps. Since this was not set in the mdp files, the default is the whole system, that is, the protein with water and, if present, ligands and ions.

Data volume

You can imagine how writing coordinates for hundreds of thousands of atoms for thousands of frames can be very memory intensive. Deciding in advance what to print and the frequency is very important, and can save you a lot of space. However, it is better to stay on the safer side as the only way to recover not printed data is to rerun the simulation.

This short section covers a couple of key points in handling trajectories and the representation of periodic boundary conditions. It is important to understand that the information of the trajectory is the same, and changing its representation doesn’t change the content. There are a few reasons why post-processing of the trajectory is fundamental. The most important is for analysis purposes, as some tools might be sensible to the representation of the molecules. For example, RMSD calculation might break down and give non-sensical jumps if the tool that calculates it doesn’t reconstruct the trajectory by fixing broken molecules across periodic boundaries. Building on this point, very large trajectories are also hard to handle from the hardware perspective, and it is better to reduce them to the strictly necessary so that tools analyzing them are faster. Handling trajectories is important also for representation purposes, e.g., rendering a fancy image for a publication. Moreover, this is connected to understanding via visualization, as watching the trajectory is very important to gain insights on the system and check possible criticalities, and human eyes work better with whole molecules.

The main tool for trajectory handling in GROMACS is gmx trjconv. Take a look at the manual for the optional flags and see if anything fits what you want to do (or check with gmx trjconv --h). This is a tool that notoriously necessitates some experience to achieve an efficient usage, but it is worth pointing out a few key commands. Generally, after running gmx trjconv GROMACS asks some further information (depending on the flags you are using) in terms of contents of the system. The most important thing here is to read properly what GROMACS asks and answer accordingly to your own purpose.

The panels of Figure 4 shows a few ways in which the trajectory can be post-processed. The videos were rendered with VMD with PKG shown as a red cartoon ribbon and the ligand as van der Waals spheres. Please consider also that the dodecahedron box is represented internally as a triclinic like box, so the PBC box represented as blue lines read by VMD is different from the animation in the representation. The panels are

The original trajectory. Molecules that are broken across the boundaries give rise to the typical spaghetti-like bonds. Notice how nothing leaves the box, as the PBCs put the atoms back in the box, but on the other side;
The trajectory corrected for PBCs broken molecules. This can be achieved with the flag -pbc mol. Notice how the protein is now whole, but GROMACS puts it back at the nearest interface with respect to its center, so the protein can ‘jump’ across boundaries. The ligand can do the same, and sometimes they can be in different position and may seem like the ligand left the pocked;
The trajectory re-centered around the protein and corrected for PBCs broken molecules. This can be achieved with the flag -pbc mol and -center and prompting GROMACS to center the protein in the box. Now everything stays in place, but it’s very hard to focus on the ligand, as the binding site wobbles constantly;
The trajectory refitted on the protein first frame of the energy minimization. This comes from two concatenated passages. The first usually is the same as the point before this one, that is, PBC removal and re-centering. Then, the output is further post-produced with trjconv by setting the flag -fit rot+trans, and prompting GROMACS to use the protein alpha carbons as reference. Briefly, GROMACS tries to match the position of the protein in each frame with the starting one (passed with the flag -s). You can see how the protein position is much more consistent and the ligand binding site - with the bound ligand - can be appreciated again.

Figure 4: The same HOLO with 1TR trajectory re-wrapped in four different ways. PKG is represented as a red cartoon ribbon, while 1TR is represented in van der Waals spheres. Water is removed for clarity. The first panel shows the original trajectory; the second shows molecules made whole across PBCs; the third centers the protein in the box; the fourth aligns the protein to the last frame of the energy minimization. You can see that the ligand was bound for the whole duration of the production.

For the exam, the main post-processing of the trajectory that you might need is the removal of the water molecules, as this will make the trajectory much smaller and easy to download and analyze. It is also worth to correct the PBCs and center the protein, which basically means obtaining what is shown in the third panel of Figure 4. You can do this in one command as in the following

gmx trjconv -f prod.xtc -s nvt_1.tpr -pbc mol -center -o center_traj.xtc

and, when prompted by GROMACS, by selecting the group Protein for centering and System for output. The output will be called center_traj.xtc and will contain the whole trajectory where all the molecules are made whole across the PBCs and the protein is centered in the box. You can then fit the protein to its starting position to better visualize the binding site (like panel 4 of Figure 4) as in the following

gmx trjconv -f center_traj.xtc -s nvt_1.tpr -fit rot+trans -o center_traj_fit.xtc

and selecting the protein C alphas for fit and the System for output. Now, center_traj_fit.xtc will contain the whole trajectory and after centering and fitting it on the starting configuration of the protein contained in nvt_1.tpr. This will probably still be a large file, so you can produce a more manageable file for visualization with VMD with the following two commands

gmx trjconv -f center_traj_fit.xtc -s em.tpr -o nowater.xtc
gmx trjconv -f center_traj_fit.xtc -s em.tpr -o nowater.gro -dump 0

In both cases, select non-Water as the output group. These will produce two files, nowater.gro and nowater.xtc, which are the first frame of the production run and the associated trajectory, respectively, but without the water. The files should be much smaller than the original prod.xtc. You can use these two files to visualize the trajectory in VMD. You can get rid of the intermediary files and remove them, that is, rm center_traj.xtc center_traj_fit.xtc.

Remove carefully

Remember that rm permanently deletes files. There is no Trash/Recycle bin in the terminal, and if by mistake you delete something that you need, like the production trajectory, you will have to re-run the simulation.

Two major points here. First, notice how the tpr passed to trjconv is nvt_1.tpr, the one compiled by the sbatch_me.sh script following the energy minimization. This is useful in cases where the refitting and re-centering is very complex due to the presence of more than one molecule which should stay consistent in terms of reference frame. For example, you can see in panel 2 of Figure 4 how simply applying a PBC correction with -pbc mol might not be enough, as in some frames the ligand is on a side of the box while the protein is on the other, giving the impression that the ligand left the binding site, while this is not the case. Such visualization artifacts can be solved by the usage of a good reference structure (like nvt_1.tpr, as in the beginning without dynamics the molecules were still at the centre of the box), some patience, and a good combination of flags (usually -pbc mol -center is a good starting point). Secondly, removing water might be a good idea for visualization purposes, but not for analysis ones. What about the interaction of water with the ligand/binding site? Are there any? Are they important? This is very relevant in systems like HOLO + ANP, where the two water molecules coordinating the magnesium ion are part of the physiological ligand-bound state. As such, use with care and keeping in mind what you want to do with the cleaned trajectory.

As a final statement, the process of reducing the trajectory shown in the latter is transferable to whatever you want to output that is contained in the box. You will just have to select the group you want to maintain in the output when prompted by trjconv. If the specific group you want is not among those automatically suggested, you can create a new one with gmx make_ndx and pass the index file to trjconv via the flag -n (the adhesion GPCR exercise has an introduction on index generation).

References

El Bakkouri, Majida, Imène Kouidmi, Amy K Wernimont, et al. 2019. “Structures of the cGMP-Dependent Protein Kinase in Malaria Parasites Reveal a Unique Structural Relay Mechanism for Activation.” Proceedings of the National Academy of Sciences 116 (28): 14164–73.

Introduction

A look into the starting files

Preparing the HPC environment

Building the starting box

Solvating

Adding ions

A look at the sbatch_me.sh file

A look at the process of equilibration

Tips and tricks for handling trajectories

References

A look at the `sbatch_me.sh` file