Let’s not forget tautomers.
A compound exhibits tautomerism if it can be represented by two structures that are related by an intramolecular movement of hydrogen from one atom to another. The different tautomers of a molecule usually have different molecular fingerprints, hydrophobicities and p K a’s as well as different 3D shape and electrostatic properties; additionally, proteins frequently preferentially bind a tautomer that is present in low abundance in water. As a result, the proper treatment of molecules that can tautomerize,
25% of a database, is a challenge for every aspect of computer-aided molecular design. Library design that focuses on molecular similarity or diversity might inadvertently include similar molecules that happen to be encoded as different tautomers. Physical property measurements might not establish the properties of individual tautomers with the result that algorithms based on these measurements may be less accurate for molecules that can tautomerize—this problem influences the accuracy of filtering for library design and also traditional QSAR. Any 2D or 3D QSAR analysis must involve the decision of if or how to adjust the observed K i or IC50 for the tautomerization equilibria. QSARs and recursive partitioning methods also involve the decision as to which tautomer(s) to use to calculate the molecular descriptors. Docking virtual screening must involve the decision as to which tautomers to include in the docking and how to account for tautomerization in the scoring. All of these decisions are more difficult because there is no extensive database of measured tautomeric ratios in both water and non-aqueous solvents and there is no consensus as to the best computational method to calculate tautomeric ratios in different environments.
Molecules that can exist as different tautomers are chameleons. By virtue of a proton hopping from one polar atom to another and the rearrangement of double bonds or ring opening or closing, a particular atom changes from a hydrogen-bond donor to an acceptor while another atom in the molecule changes from a hydrogen-bond acceptor to a hydrogen-bond donor. Tautomeric reactions in which a heterocyclic ring is opened and closed also change the shape of the molecule.
Small changes in molecular structure or solvent environment can dramatically change the ratio of tautomers: Such changes complicate the assignment of a physical property measurement to a specific chemical structure, the identification of the bioactive species from a tautomeric mixture, and the probability that a “minor” species is the one recognized by a macromolecule.
Although there are many reasons for not carefully considering tautomers in computer assisted drug design, the time has come to take up the challenge. This perspective is not a comprehensive review, but rather a sampling of the experimental information available on tautomers, the implications of these observations, and possible approaches to a more reliable consideration of tautomers in drug design. Although others have also highlighted the issue of tautomers [1–6], the full impact of tautomerism has not received comprehensive attention from the computer-aided drug design community.
Experimental observations of tautomers.
Rate of tautomerization.
In general, if the tautomerism involves moving a proton from one heteroatom to another, the reaction is fast, particularly in aqueous solutions . In these cases, NMR studies see both tautomers  and experimental measurements of log P , log D , or p K a contain contributions from all tautomers unless the analytical detection method has been specifically designed to detect only one. On the other hand, tautomerization may be slow if it involves a ring-chain equilibrium or if it involves moving a proton from a heteroatom to carbon atom.
Examples of the relationship between structure, solvent, and the tautomer ratio.
The ratio of tautomers of any compound is highly dependent on the structure of the solute well as the solvent [7, 9]. For example, crystallization conditions may induce different tautomers of the same molecule or the two forms might co-exist in a single crystal [10–13].
Figure 1 shows examples of tautomeric equilibria in water . Note that the equilibrium between 4-hydroxypyridine and 4-pyridone is affected by the solvent, by intramolecular hydrogen bonding, and by the electronic effects of substituents. In water the thione form of 4-mercaptopyridine predominates, but the equilibrium switches to the thiol form for 2-mercaptothiophene. The absence of numbers for some of the equilibrium constants in Fig. 1 indicates that although it was possible to establish the predominant tautomer, it was not possible to quantitate the concentration of the minor form.
Figure 2 shows an example of the change in tautomer ratio as a function of solvent and of structure . The replacement of one of a pair of enolizable hydrogens by a methyl group increases the proportion of the NH form in all solvents and increases the proportion of the OH form in both non-polar solvents. Note that tautomerization would also racemize the chiral carbon of Structure 2.04.
Ring-chain tautomerism is well established in carbohydrates, but it also occurs in other molecules such as warfarin, Fig. 3 . An example of the substituent effect on this type of equilibrium is shown in Fig. 4 . Substitution of an ortho hydrogen with a nitro group favors the open form, whereas substitution with an amino or hydroxy group favors the cyclic form. The equilibrium constant for ring closure follows a Hammett relationship.
Clearly if one were comparing the biological properties of the compounds in Figs. 1 , ,2, 2 , ,3 3 and and4, 4 , it would be important to be alert to the possibility that tautomerism might complicate the structure-activity relationships.
Examples of ligand tautomer preferences of macromolecules.
Often the resolution of a protein crystal structure cannot clearly establish the tautomer of the bound ligand. However, there are several documented cases where the bound tautomer has been unambiguously established. Figure 5 illustrates the contrast between the solution structure of a barbiturate analogue and that in a 1.8 Å crystal structure as bound to matrix metalloproteinase 8 . Others have shown with SCRF-HF/6-31G** calculations that the tautomer of unsubstituted barbituric acid that corresponds to the bound tautomer is 20.05 kcal/mol less stable in polar medium . Figure 6 shows the tautomer of pterin bound to the 2.3 Å structure of ricin toxin A-chain. It is 3 kcal/mol higher in energy (AMSOL in AM1-SM2 Hamiltonians) in solution than the favored tautomer .
In some cases more than one tautomer is bound to the protein. For example, Fig. 7 shows the two tautomers that are bound with equal occupancy in a 1.53 Å structure of CDK . This result contrasts with crystal structures of similar compounds in KDR  and PDGF , two other kinases, in which only the 2,4-dihydroindeno tautomer, the left structure, is observed. Macrophage migration inhibitory factor (MIF) catalyzes phenylpyruvate tautomerization, Fig. 8 . It catalyzes the reaction in both directions, and hence binds both tautomers, although the enol-keto direction is preferred.
Enzymes can also select one species from a ring-chain equilibrium. For example, Fig. 9 shows the tautomers of chlorthalidone, a carbonic anhydrase inhibitor. The crystal structure of the carbonic anhydrase II-chlorthalidone complex shows that it is not bound as the amide form, but rather as an unusual lactim tautomer .
Proteins can bind different tautomers of related compounds. For example, glucose is a substrate for xylose isomerase and xylitol is an inhibitor, Fig. 10 . Interestingly, the 0.95 Å crystal structures show that glucose is bound as a ring tautomer, not the chain form as expected from the structure of xylitol [25, 26].
A slightly more complex process is involved with the anti-tuberculosis drug isoniazid. It first forms an adduct with NAD(P); this adduct then inhibits a long-chain enoyl-acyl carrier protein reductase (InhA) . Figure 11 summarizes the structures involved. Measurements on model compounds show that in contrast to the bound structure, in water the ring tautomer is favored by a factor of 2 [28, 29].
Complementary hydrogen bonds of bases in DNA lead to the formation of the characteristic double helix of DNA. When the base-pair mimics shown in Fig. 12 form a double helix with complementary DNA, the analogue that positions the tautomerizable group in the major groove is in the keto-amino tautomer . However, the analogue that binds in the minor groove is in the syn-enol tautomer. The differences in tautomer preferences reflect the differences in the character of the major and minor grooves.
Frequency of molecules that can tautomerize.
A summary of one program’s enumeration of tautomers  of marketed drugs  is shown in Fig. 13 . Of the 1,791 compounds, 1,334 or 74% exist as only one tautomer—put another way, 26% exist as an average of three tautomers. For this dataset and enumeration program 2,949 tautomers are found; this increases the size of the dataset by 1.64-fold. Using a different tautomer generating program, others have found similar or slightly more increases in the size of a database . Hence, although consideration of tautomers will increase the number of structures considered for virtual screening, the increase should be manageable.
Calculated properties of tautomers.
p K a Differences between tautomers.
Because the tautomers of a molecule have different structures, they differ in their ability to gain or lose a proton; their p K a values. In the simple case of an ionizable molecule that has two tautomeric forms, the tautomeric ratio is a function of the p K a’s of the tautomers. For example, consider the tautomeric and ionic equilibria of 6-chloro-2-pyridone in water, Fig. 14 . Algebraically K t = K a OX / K a OH . Hence, one can calculate the value of any one of these equilibrium constants from values of the other two.
The observed p K a of a tautomerizable molecules is a composite of several individual microscopic ionization constants and the tautomeric equilibrium constant(s) . For example, the protonated form tetracycline (Structure 10 ) can be present as any one of nine tautomers, and the neutral form by ten . Each of these 19 species could contribute to the observed p K a as well as the biological properties and octanol-water log D of the molecule. Similarly, 8-oxoguanine (Structure 11 ) can exist as one or more of 100 neutral or anionic tautomers. This complicates investigations into its mechanism of mutagenicity .
Calculation of the tautomer ratio in solution.
Although many workers have investigated the relative stabilities of tautomers in different liquid phases, because of the difficulty of measuring the equilibrium constants there is no publically available comprehensive database of this data. This lack hinders the development of empirical methods to predict the ratios of tautomers of a molecule. The implications of the lack of experimental data are described in detail in an article on predicting p K a , a less complex equilibrium constant.
If the tautomerization involves only the movement of a proton between sites, the tautomer equilibrium constant can be calculated from the p K a of each tautomer. This relationship holds because deprotonation of the tautomers lead to resonance structures of a common structure. Hammett-type  or empirical charge  relationships can be used to calculate the p K a’s of the tautomers and hence the tautomeric ratio. However, even these calculations have errors in the range of 0.8 log units .
More elaborate, but not necessarily more accurate, calculations involve free-energy perturbation  or quantum chemical calculations [18, 19, 28, 33, 38–48]. To date there appears to be no consensus as to the most appropriate method.
Calculated octanol-water log P of tautomers.
Usually the tautomers of a molecule have different hydrophobicities. Because small changes in structure or solvent can dramatically change the tautomeric ratio, ignoring the possibility of tautomerism leads to complications in assigning the specific molecular structure of a substance for which octanol-water log P has been measured. Indeed, usually the tautomer ratio in each phase has not been established. This ambiguity in turn results in inaccuracies of computational models to predict log P . For example we  and others  showed empirically that programs that calculate octanol-water log P are less accurate for molecules that can tautomerize.
Calculated log P values are often used to filter compounds for virtual screening, presumably because of its inverse correlation with water solubility [51–53] or permeability . Such relationships have not been investigated to see if they also apply to molecules that can tautomerize.
In addition, calculated log P values might be used to predict brain to blood ratio using the simple equation that includes terms for log P and polar surface area, PSA . Although PSA is quite similar for tautomers, the figures in this report show that tautomers of a molecule usually have different hydrophobicities. The question then becomes, which log P value should be used in the brain penetration calculation—should we assume that blood is like water and use the log P of the dominant form in water, or do we recognize that tautomerization is fast and use the log P of the more hydrophobic form to simulate brain tissue?
Figures 5 , ,6, 6 , ,7 7 and and8 8 contain values of octanol-water log P calculated by two popular programs. Note that not only do the values calculated from the different programs seldom agree, but often they do not even agree as to which tautomer is more hydrophobic. As another example, Table 1 lists the calculated octanol-water log P of the tautomers of sildenafil (Viagra) and phenobarbital. Although the programs suggest little difference between Tautomers 1 and 3 of sildenafil, KowWin predicts that the enol form, Tautomer 2, is the least hydrophobic, whereas CLOGP and ALOGP suggest that it is the most hydrophobic of the three. As a consequence, CLOGP and ALOGP predict that Tautomer 2 is the predominant form in the water-saturated octanol phase, whereas KowWin predicts that it is the minor form in this phase. Similar contradictions are seen with the calculated log P of phenobarbital tautomers: CLOGP predicts that Tautomer 1, the tautomer most highly populated in water, is also the most hydrophobic tautomer, whereas ALOGP predicts that it is the least hydrophobic tautomer.
Cheminformatics issues with tautomers.
Identifying if a molecule is in a database.
This problem has been discussed by others [3, 55, 56]. Because the tautomers of a molecule do not have the same molecular structure, they will usually be encoded differently in the bitmaps or fingerprints that are used to discover if a particular molecule is in a database. An example of different tautomers registered in different databases is seen with sildenafil: Although Tautomer 3 (Table 1 ) has been reported to be more stable than Tautomer 1 and it is the one associated with a Chemical Abstracts  Number, Tautomer 1 is listed as the structure in PubChem  and ChemSpider .
The usual solution to this problem is to use a special algorithm to generate a unique tautomer, usually one assumed to predominate in water [3, 55]. Unfortunately, different software vendors use slightly different algorithms with the result that the same compound can be represented differently in different databases.
Substructure searching and identification.
Substructure search queries that will identify tautomers need to be constructed with this possibility in mind. For example, if one uses Structure 1.03 as a search query, if the ring is specified to be aromatic, then molecules that contain Substructure 1.04, perhaps as the N -methyl derivative, would not be found.
Many cheminformatic investigations involve an analysis of the substructures present in the molecules under consideration. For example, QSARs or recursive partitioning may be based on the relative frequency of certain substructures in active versus inactive compounds: Clearly, such investigations are compromised if they do not include the substructures that are present in any (or most abundant?) tautomer of the molecule. The examples in Figs. 5 , ,6, 6 , ,7, 7 , ,8, 8 , ,9, 9 , ,10, 10 , ,11 11 and and12 12 show that one cannot focus exclusively on the “major” tautomer.
Table 2 shows Tanimoto similarities calculated with ECFP4 fingerprints  and the probability, based on the similarity, that the two compounds will have potency within 10-fold of each other . The columns on the left list the similarities and probabilities between tautomers; the columns to the right list these values for the most similar molecule in this small dataset. Note that in most cases the most similar molecule is not a tautomer of the query molecule. Only if the query structure is rather complex is the tautomer similar. Note the low similarity between Structures 5.01 and 5.02. This result shows that even simple similarity searching can be misleading if one ignores tautomerization.
Because similarity calculations form the basis for clustering and diversity selection, incorrect handling of tautomers can result in erratic results.
Tautomer enumeration programs.
Cheminformatics software vendors recognize the problems that tautomers cause. As a result, most supply a tautomer enumeration program, generally only heterocyclic tautomers. To date, there has been no comparison of the different programs, probably because there is no recognized database. The users interested in using a database for virtual screening must then decide if they will enumerate all possible tautomers or just a few that are likely to be the most abundant in water.
Implications of tautomerization for QSAR.
Figures 1 , ,2 2 and and3 3 remind us that within a series the ratio of tautomers in either the water or a non-aqueous phase is not constant. Because QSARs correlate the total concentration of a molecule with some biological effect, tautomerization has the effect of adding equilibria in addition to those for drug-target and drug-distribution. For example, correcting the observed concentration to that of “bioactive” tautomer in the aqueous phase does not account for the differential partitioning of tautomers of the various analogues to inert nonaqueous and receptor phases or that the target biomolecule may recognize a minor tautomer.
As noted above, for substructure-based QSARs, the first issue is to decide which tautomers should be included in the analysis. The second issue is how the algorithm allows the model to ignore some of the tautomers of a molecule.
Tautomerization complicates the calculation of molecular descriptors for traditional 2D QSAR . For example, it may be ambiguous which calculated log P values to use as a molecular descriptor. Hence, the reliability of QSAR analyses that use hydrophobicity as a descriptor may suffer. In addition, because tautomers of a molecule have different p K a’s, assigning a physical property to a specific molecular structure is especially challenging if the molecule can also ionize at pHs of interest . On the other hand for 3D-QSARs, one must decide which tautomer as well as which conformer to use for the analysis.
Implications of tautomerization for docking molecules.
High throughput docking programs are generally imprecise enough that one can attempt to dock all reasonable tautomers of a molecule. If the objective of the study is to identify compounds for experimental testing, if any tautomer of a molecule has a high score, validation is provided by experimental testing.
On the other hand, if the objective of the docking is to propose the structure of the protein-ligand complex, the preliminary docked structures would then be refined to optimize the fit and provide a prediction of affinity. This optimization would involve exploring the conformation of the ligand and the protein active site as well as the protonation and tautomeric state of both.
One strategy is to optimize and calculate the energy of every possible tautomeric and protonation state of the system, both in water and in the active site. This can be done with molecular mechanics force-fields [6, 19, 63, 64], with quantum mechanics [25, 29, 64–66], or a combination of the two . At the current time, no method is particularly accurate—errors of 0.7–1.0 log units for each of the components are not uncommon [35, 67–69]. A quantum mechanical or QM/MM structure optimization would reveal the bound tautomer of both the ligand and the protein [67, 70]. For such calculations one would have to decide the level of theory necessary and whether the whole complex will be treated quantum mechanically or, if not, how the boundary between the quantum and molecular mechanics will be handled. Because a thermodynamic cycle is involved, the use of any method requires that it can reliably predict the ratio of tautomers in aqueous systems.
Directions for the future.
The need for more experimental data.
This review emphasizes the need for more experimental information on the tautomeric ratio of diverse molecules in water and various solvents. Such observations would form the basis for methods to predict the tautomeric ratio and a test bed to compare the accuracy of the various empirical and quantum chemical methods. Unfortunately, these measurements are difficult to design and often require synthesis of model compounds in hopes that they accurately mimic the properties of the corresponding tautomer.
Careful measurement of the impact of tautomerization on p K a and water solubility would provide information that would improve the predictions of these properties.
The need for cheminformatic databases that can maintain information about tautomers.
Once a body of information is available, it might be discovered that enhancements must be made to the current architecture for storing chemical structures and information . For example, consider the problem of a database that would store all of the tautomers of Structures 10 or or11. 11 . Such a database would need to store not only the canonical tautomer but also structures and available properties of each individual tautomer, the measured or calculated equilibrium constants between the tautomers, and the properties of the compound itself.
The need for computer programs that predict ring-chain tautomerization.
Rules for ring formation in organic synthesis have been formulated by Baldwin . These would provide a starting point for a program that would enumerate ring-chain tautomers, a capability absent from the current tautomer generation programs.
The need for validation of the various computational methods.
Although the various methods to explore the structure and energetics of enzyme-ligand complexes are interesting, for such methods to be useful they must be validated. For example is the continuum solvent assumption sufficient, or is it important to include explicit water molecules in the calculation? Before QM/MM calculations can be used in routine investigations of protein-ligand complexes, they will need to run faster and with less human interaction. A point to be examined would be whether a semi-empirical method [72, 73] might be sufficient for the quantum mechanical portion of the calculation and, indeed, whether the whole system can be accurately calculated with semi-empirical methods.
Tautomerization equilibria present a continuing challenge to computer-aided molecular design, affecting everything from library design to SAR to docking and scoring protein-ligand interactions. The absence of experimental data and validated computational methods make tautomerization easy to ignore but overwhelming to consider.
Let’s not forget tautomers.