Protein-ligand binding is the notion that a small molecule (a drug, aka. the ligand) binds to a receptor or protein in the body. This binding event evokes a biological response, possibly the reduction of inflammation, pain relief, etc. Typically, there are a limited number of poses or configurations that this protein-ligand complex can assume (or possibly only one). Identifying this bio-active pose is a tremendous challenge in drug discovery. Frequently, it is thought to be the lowest energy pose for either the protein or the ligand, but that is typically not the case. The complex can stabilize or make up for a higher energy conformation of the ligand, etc. Both the protein and ligand are three dimensional and flexible and therefore are constantly changing shape. This is a multi-step problem. Starting with the ligand, one has to identify the bioactive 3D conformation of the ligand. Moving on then to the protein, the bioactive conformation is an even bigger challenge partially because the molecule is so much bigger and there are more possibilities. Lastly, if one could identify both the bioactive conformation of the ligand and the protein, then one is challenged to place the ligand in the correct location and orientation within the protein to produce the desired activity.

There are many ways to generate these poses, as well as many ways to try to determine which ones are (or may be) correct. Some of these calculations are computationally inexpensive, while others may be extraordinarily expensive. One approach to this problem is to generate a large number of potential poses using a fairly inexpensive method and follow that up with a more expensive calculation to rank them in order of likelihood of being the bio-active pose. However, it is still easy to generate many more potential poses than one can afford to apply an expensive method

The Protein Data Bank (PDB) is a database of known crystal structures and Nuclear Magnetic Resonance (NMR) structures, many of which are protein-ligand complexes. By mining the information contained in these structures, we are generating a scoring function based on known protein-ligand interactions. That is why we are processing through the entire PDB to extract out the interactions between small molecules and proteins. The output of the reduce code is the set of observed interactions after applying a distance bin technique. The distance bins simplify the comparison of a potential interaction to the actual observed interactions. This is just taking a set of observed distances and clumping them together. In order to include some of the atomic environment information, atom types are used rather than simply using the atomic element. This differentiates between aromatic and aliphatic carbons, nitrogens that are in an amide bond versus a primary amine, etc. Once the counts of the observations are tallied, one can transform them into percentages of the time that a given interaction is observed.