There was a time when an essential part of most chemistry experiments involved glass blowing, but no longer. Stockrooms full of standardized flasks, condensers, connectors, and tubing have allowed chemists to focus a little more on chemistry and a little less on plumbing. While it's wonderful when scientific breakthroughs allow us to take great strides, we also move forward, inch by inch, when we remove small burdens or simplify some routine task.When preparing for a computational experiment on a macromolecular complex, one common task is to segregate the components of the complex by their functional roles, keeping those we require—say, protein, cofactor, and substrate—and discarding the rest—detergent or water. It's not particularly difficult, just a few minutes with a molecular visualizer/editor and we're done. But while it may be quick and easy, it becomes painful when it needs to be done 50 times. Or 1,000.
How hard can this be to automate? Well, it'd be trivial if a component's functional roles were unambiguously specified. In some companies, for example, the substrate in an internally produced structure is always given the name LIG, or something along those lines. On the other hand, in structures from the Protein Databank, all small molecules are classified as ligands—nothing distinguishes the substrate from the detergent or the cofactor. There are a myriad of cases where the trivial becomes a head scratcher for anyone trying to automate this seemingly simple task. It's a real shame that all that valuable information about functional roles was typically not captured by the authors of the structures we use, but we have to deal with the world as we find it.
Because OpenEye felt it was worth the effort, I have recently spent a good bit of time chewing over the problem to come up with OESplitMolComplex() and related functions to automate this task (available in the June 2015 OEChem toolkit). The API is flexible enough to address a wide range of use cases while remaining simple. The prototype has been part of SiteHopper for more than a year. At the core of this work is the recognition that the number of substrates will continue to grow as long as there are medicinal chemists, but the number of cofactors, buffers, and solvents will grow much more slowly. Until the day each substrate is clearly marked, a good guess can be made by considering the molecules that aren't proteins, cofactors, buffers, or solvents.
Of course, there is no single "correct" solution to problems of this sort; other people might go about things differently. Our approach is primarily geared towards substrates of interest to pharmaceutical research. A side benefit is that it also allows us to count binding sites, distinguish monomeric and multimeric binding sites, identify apo-proteins, and extract covalent ligands from a protein. We've just begun to work with this API; our lists of cofactors, buffers, etc. will need to be expanded, and there will always be troublesome cases to work on. But this approach has already simplified a growing list of tasks, allowing us to focus more on the chemistry than on the plumbing.