QSAR Principles and Methods Quantitative itructure Activity R relationship (QSAF is involved in bi jildin g mathematical models for correlatin g molecular structures with molecular propertie: 5. In thi is : secti on we i ntroduce thi e notion of molecular descriptors and present the QSAR model and it s validation. Author(: s): Hanoch Senderowitz (Predix Pharmaceutical) ,c Claude Cohen I [Synergix Hrerequis lies: None i Number r i—' .__ 191 OT rages >'. (1 y-q- screens ) Last update> ri- April 200< i al. \ available Bill ntroduction to QSAR The topic Introducti on tc i QSAR contains thi e following 20 pages: ■ Molecular Structure and Molecular Properti es ■ e. Structur _ i Property Relationship: Exampl e 1 e-l 5: 1 ■ C >. Structur e-l Property Relationship; s: Exampl e : I Jtructur „ i ___L . _ r T _ 1 _ L e : e-l ^ropeny rceiauonsnip: s: txampi 3 What is QSAR? L A I vvnai is uz>rn < Focus on a Sii nql le 1 Procertv at a Time I Molecular uescripti Drj I Examol es of Molecular Descriotors The < JSAR Equations Types of Mnlpnular Descriptors ) ■ l\ /lolecular Descriptors : 1 D ■ IV /Inlpri ilar Descriptors 9n I 1 r i i 1 ^ F 1 1 ■ ■ ■ For the entire lis f G ;ee the navigation panel. I, a F1.1.1 Molecular Structure and Molecular Properties One of the most pervasive postulates in the life sciences is that all molecular properties are coded by and consequently result from molecular structure. Some examples of structure-property relationships are illustrated on the following pages. / * biological properties 00 It T L0101001010 (' chemlcal Pr<>Per+ies ) ^ |^ I y{ • physical properties ^\/^^/^\pN I • electronical properties Molecular Structure ^ri 1innift1/\ini F1.1.2 Structure-Property Relationships: Example 1 Paracetamol selectively inhibits the cyclooxygenase enzyme COX-3 found in the brain and spinal cord and consequently relieves pain and reduces fever. Structure Property F1.1.3 Structure-Property Relationships: Example 2 Cyanide exerts its toxicity by inhibiting cytochrome-c oxidase, the terminal enzyme of the respiratory chain, leading to insufficient utilization of oxygen and suffocation. Inhibition occurs through binding to the ferric ion of the cytochrome. Structure Property II Cyanide F1.1.4 Structure-Property Relationships: Example 3 Saccharin (usually sold as sodium saccharin) binds to the sweet taste T1R3 receptor located in the plasma membrane of the sweet-taste sensory cells located in the taste buds. Binding of saccharin to T1R3 initiates a cascade of events in the taste-sensory cell that eventually releases a signaling molecule to an adjoining sensory neuron, causing the neuron to send impulses to the brain. In the brain, these signals cause the actual sensation of sweetness. Structure Property Saccharin F1.1.5 What is QSAR? Molecules exert their biological effect by binding to their respective receptors, a phenomenon that in turn is governed by their molecular structures (and the molecular structure of the receptor). QSAR (Quantitative Structure Activity Relationship) attempts to formulate the relationship between structure and activity as a mathematical model. Biological effect = i (Molecular Structure) Quantitative Structure Activity Relationships F1.1.6 What is QSPR? The biological effect is just one example of molecular properties. QSPR (Quantitative Structure Property Relationship) is an extension of QSAR and is designed to formulate the relationship between structure and any molecular property as a mathematical model. Other properties include for example: solubility, oral bioavailability, metabolic stability and cell permeability. Property X = *p F1.1.7 Focus on a Single Property at a Time No single QSPR model can capture the direct connection between all the properties of a compound and its molecular structure; oniy a single property is handled at a time. 11 . Q5p* Property 2 OOH F1.1.8 Molecular Descriptors Thus, the derivation of a direct relation with the molecular structure of one single property is extremely challenging. However, structural factors known as molecular descriptors that influence the molecular property can be identified. For this reason, the QSAR model correlates the property with molecular descriptors. F1.1.9 Examples of Molecular Descriptors Examples of molecular properties with their associated descriptors are listed in the following table. Later on in this chapter the nature and the meaning of some QSAR descriptors are presented. Molecular Property Descriptors Lipophilicity k, log P, R/w.f Steric Properties Es, MR, MV, parachor Electronic Properties a, R, F F1.1.10 The QSAR Equations All QSAR equations have a molecular property expressed as a function of specific descriptors. They differ in terms of the property they are attempting to correlate, the descriptors they use and the mathematical expression of the model. Oral bioavailability = ^(descriptors s ;etl) Cell permeability = ^(descriptors s ;et2) Toxicity = ^(descriptors s ;et3) Metabolic stability = ja (descriptors s >et4) Receptor binding = ^(descriptors s ;et5) F1.1.11 Types of Molecular Descriptors Molecular descriptors can be classified according to the dimensionality of the molecular structure from which they are derived. 1D descriptors are derived from the chemical formula, 2D descriptors are derived from a 2D (chemdraw-like) structure and 3D descriptors are derived from the 3-dimensional structure. F1.1.12 Molecular Descriptors: 1D The chemical formula constitutes a 1-Dimensional representation of the molecular structure from which 1D descriptors can be derived. Such descriptors are based exclusively on the type of atoms which make up the molecule. C7H5NO3S Molecular Weight (gr/mol): 183.2 F1.1.13 Molecular Descriptors: 2D A Chemdraw-like structure constitutes a 2-Dimensional representation of the molecular structure from which 2D descriptors can be calculated. In addition to types of atoms, 2D descriptors also incorporate the bonding pattern of the molecule. Rotatable bonds: 0 H-bond donors: 1 F1.1.14 Molecular Descriptors: 3D 3D descriptors derived from a 3D molecular structure take the spatial arrangement of the atoms in the molecule into account. Dipole Moment: 2.2 Debyes F1.1.15 A Multitude of Molecular Descriptors The number of descriptors that can be derived from a molecular structure is virtually unlimited. Currently available software packages can calculate thousands of descriptors. For example the DRAGON program calculates 1612 descriptors distributed into 20 categories. constitutional descriptors WHIM descriptors topo(og|cQ| chapge -mdiCes topological descriptors molecular properties Randic molecular profiles RDF descriptors BCUT descriptors functional group counts eigenvalue-based indices information indices atom-centred fragments walk and path counts geometrical descriptors edge adjacency indices charge descriptors 3D-MoR5E descriptors connectivity indices 2fr autocorrelations GETAWAY descriptors F1.1.16 Biologically Relevant Descriptors When constructing a QSAR model, the key is to use descriptors that are relevant to the specific property of interest. These "biologically relevant descriptors" help generate a model that differentiates between molecules that possess the property of interest and those that do not. % Non-Relevant Descriptor • Relevant Descriptor Inactive molecules Descriptor 43 £ Active molecules F1.1.17 Application of QSAR QSAR models are built for three main reasons: to understand the relationship between structure and activity, design compounds with improved activity, and predict the activities of compounds prior to their synthesis. These reasons in fact adhere to the rational sequence of a QSAR analysis project where the first step is to understand the phenomenon, and then use this understanding to design new compounds. Understanding Design F1.1.18 Understanding Structure-Activity Relationships A good model can reveal information about the receptor's binding site. For example a correlation with electronic descriptors may indicate that the biological activities could be due to the chemical reactivity of the compounds, or alternatively, a correlation with hydrophobic descriptors may reveal the existence of a hydrophobic pocket in the receptor. F1.1.19 Designing Compounds with Improved Activities Once a QSAR model is obtained and reproduces the known data satisfactorily, it can be exploited to predict the biological activity of not yet synthesized analogs. This is of paramount importance in lead optimization and represents one of the most popular uses of the QSAR approach. Compounds not yet synthesized QSAR model Prediction of the biological activities F1.1.20 Reducing a Virtual Library to a Practical Size The recent explosion in combinatorial chemistry has added a new dimension to the QSAR approach by reducing a huge virtual library to a manageable size for combinatorial synthesis and high through-put screening. Virtual Library Generator Biological Activity Prediction (2) F1.2 The Foundations of QSAR The topic The Foundations ; of QSAR contai ns th e following 28 pages: ■ Birth of QSAR ■ The Foundations of QSAR ■ The Hammett i contribution ■ ■ r S ■ _ _ j_ _ ___i_ _ r ■ L " j_ _ _i i -i _ i _ - L jissociatn an oonstants ot uUDSiuuiea öenzoic Acias ■ C )issociatii an of Substituted Phenvlacet ic, Acid* s ■ i i ■ ■ i ■ ■ L .inear i-ree i Lnergy Keianonsnif ■ 1 "he Hammett Eauat ior i i - ■ I ne Meani ng ot p ■ The Meani HQ of a ■ h ixamples of a Constants ■ F Predicting the pKa of Benzoic Acid Compound s ■ Hanscr i Contribution ■ i "he Import :ance of Lipophilicit v ■ j ■ ■ ■ For the entire lis t B ;ee the navigation panel. I, s F1.2.1 Birth of QSAR QSAR dates back to the 19th century with the work of Cros (1863) who first observed an inverse correlation between the toxicity of alcohols and their water solubility. Other important milestones include work by Crum-Brown and Frazer who related physiological action to chemical constitution (1868). A few years later Horst, Overton and Richet independently observed that the toxicity of organic compounds depended on their lipophilicity/solubility. This discovery was followed by research by Meyer and Overton, who proved that anesthetic potency correlated well with partition coefficients (1899). • 1863 Cros • 1868 Crum-Brown & Frazer • 1890's Horst & Overton • 1893 Richet • 1899 Meyer-Overton inverse correlation between toxicity and water solubility of alcohols "physiological action" is a function of "chemical constitution" toxicity of organic compounds depend on their lipophilicity. "more they are soluble, less they are toxic" partition coefficients correlate with anesthetic potency F1.2.2 The Foundations of QSAR During the first half of the 20th century, Louis Hammett laid the foundation for modern QSAR by correlating electronic properties of organic acids and bases with their equilibrium constants and reactivity. An important landmark in the development of QSAR took place in 1964 with the introduction of the Free-Wilson method and Hansch analysis. This section covers these three seminal contributions to QSAR in some detail. • Louis Hammett • Free-Wilson • Corwin Hansch F1.2.3 The Hammett Contribution The dissociation of HA organic acids is a process by which a proton (H+) is removed from the neutral compound, leaving behind a negatively charged species (A"). The extent of the reaction is measured by the dissociation constant K. Louis Hammett observed that the dissociation constants of aromatic acids are influenced by the electronic properties of the substituents on the phenyl ring. HA H+ +A~ [ ^4] F1.2.4 Dissociation Constants of Substituted Benzoic Acids The dissociation constants of substituted benzoic acids indicate that electron withdrawing groups increase dissociation while electron donating groups decrease it. • p-Et ft Benzoic Acid « p-N02 electron withdrawing Dissociation Constant (105) K0=6.2 F1.2.5 Dissociation of Substituted Phenylacetic Acids A similar effect exists for other equilibria such as substituted phenylacetic acids. m p-Et % Phenylacetic Acid « p-N02 coo electron withdrawing Dissociation Constant (105) K0=5.2 F1.2.6 Linear Free Energy Relationship When plotting the quantity log(K/Ko) for benzoic acids on the X axis, where K and Ko refer to the unsubstituted and substituted compounds, respectively, and the corresponding values measured for the same set of substituents in phenylacetic acids on the Y axis, Hammett obtained a straight line. Because of the association between dissociation constants and free energies [AG=-RT Log(K)] this phenomenon is known as the linear free energy relationship. Benzoic Acid R Et -0.15 H Phenylacetic Acid R N02 4.2 x IQ"5 H u — 0.1 ■ H -0.1 - Benzoic Acids log(K/Ko) -0.2 F1.2.7 The Hammett Equation The straight line described on the previous page can be written as a linear equation, the Hammett equation. Note that p is related to a given scaffold (e.g. phenylacetic acids), whereas a o is a descriptor of a substituent and describes its influence on the dissociation constant. It is positive for electron withdrawing substituents and negative for electron donating substituents. p pertains to a given equilibrium as compared to the benzoic acid equilibrium. a is a descriptor of a substituent F1.2.8 The Meaning of p p describes the magnitude of the effect a substituent can exert on the dissociation reaction of a given scaffold. As the distance between the substituent and the dissociated proton increases, its influence on the dissociation reaction decreases and so does the value of p. 0) Benzoic Acid • Phenylacetic Acid • Phenylpropionic Acid F1.2.9 The Meaning of a a describes the effect of substituents on the dissociation reaction. Substituents on the phenyl ring can increase or decrease the equilibrium constant by stabilizing or destabilizing the anionic form via the formation of a positive or negative partial charge at C1. H3C -5 l COOH Destabilizes anionic form Decreases dissociation F1.2.10 Examples of o Constants Electron donating substituents have negative o values, whereas positive os correspond to electron withdrawing groups. Note that o values differ depending on whether the substituent is meta or para (sigma values are clickable). F1.2.11 Predicting the pKa of Benzoic Acid Compounds The Hammett equation is an example of a QSPR equation. It correlates a molecular property, the dissociation constant, with a set of molecular descriptors (o and p). It can be used to predict the pKa of benzoic acid analogs. When a molecule has multiple substituents, the o values are summed to yield the total value for the compound, as shown in the following example. pKacid = 4.2 - 1.00 (0.71 - 0.13 + 0.71) = 2.91 F1.2.12 Hansch Contribution Corwin Hansch recognized the importance of lipophilicity for biological activity. Lipophilicity can be defined as the tendency of a compound to prefer an organic phase rather than an aqueous phase. Molecular lipophilicity is correlated with the presence of hydrophobic groups (e.g. aromatic rings) and with the absence of polar and ionizable groups. F1.2.13 The Importance of Lipophilicity To exert its biological effect, a drug must reach its site of action. On its way, the compound encounters biological barriers. For example, the cell membrane is composed of a lipid bilayer, therefore lipophilic compounds are able to penetrate such barriers. The lipophilicity thus determines in part the amount of the compound that reaches the target site. cell membrane F1.2.14 LogP is a Measure of Compounds Lipophilicity LogP describes the lipophilicity of a compound, and is the measure of the partitioning of this compound between a lipidic phase (usually 1-octanol) and an aqueous phase. It is represented by a partition coefficient P or by its log value. Compounds with P>1 (logP>0) are lipophilic whereas compounds with P<1 (logP<0) are hydrophilic. Organic phase Aqueous phase (buffer) P [drug] org [drug] aq P>1 (logP>0) More lipophilic drug P<1 (logP<0) More hydrophilic dru F1.2.15 Correlation of LogP with Biological Activities Hansch observed that in several compound series, the biological activities (inhibition of frog heart) correlated well with logP. The resulting QSAR expression turned out to be quite simple: a linear equation. Hansch 1970 Inhibition of frog heart by miscellaneous chemicals Biological Activity =_/(logP) Biological Activity = Q.91 (QgP + Q.14 n = 34: r2 = 0.951 F1.2.16 Example of Correlation with LogP A correlation with logP was found for the isonarcotic activities of diverse esters, alcohols, ketones, and ethers (structures, biological activities and logP are shown by clicking the button "training set"). The training set covers a large range of values for logP and the model generated is quite simple. In the equation log(1/C) represents the biological activity response. # correlation « training set 2.5 log(l/C) = 0.731 log P + 1.22 1.5 0.5 -1.5 -0.5 0 log P 0.5 1.5 F1.2.17 Improvements of the Initial Model When attempting to ameliorate his initial idea (only one descriptor), Hansch further refined the QSAR equation and introduced other descriptors such as o (electronic), Es (Taft), tt (lipophilicity), MR (Molar refraction) that represent properties that may influence drug action. He also considered other mathematical equations likely to represent the behavior of the biological system. The analytical forms of some of the typical QSAR models he developed are indicated below, whereas the most common descriptors used in Hansen's analyses are presented in the following pages. Typical Hansch QSAR Models log 1/C = a (log P)2+b log P + c log 1/C = a7i2 + b7i + c log 1/C = a (log P)2 + b log P + c a +d log 1/C = a7i2 + b7i + ca +dEs + c F1.2.18 The tt Descriptor The tt descriptor characterizes the lipophilicity of a substituent. It is defined as the difference between the logP of the substituted and unsubstituted compounds. Substituents that are more hydrophobic than H have tt positive values, whereas substituents less hydrophobic than H have negative values. The tt values for several common substituents are listed below, and an example is provided. Example log P=2.27 log P=2.13 JZf =2.27-2.13 = 0.15 og P- log R Substituent H 0.00 15 ^| CI 0.70 Br 1.02 OH -0.67 Me 0.52 F1.2.19 The MR Descriptor Molar refractivity (MR) is a molecular descriptor that contains information on the compound's volume corrected by the refractive index (the ratio of the velocity of light in a vacuum to the velocity of light in the substance of interest). In the definition below, d is the density and n is the refractive index. Several MR values are listed below. (n2-l) MW 34.37 29.84 33.89 33.18 57.92 F1.2.20 The Taft Descriptor (ES) The Taft steric constant Es is an experimental value based on rate constants for a given model reaction. It is a measure of the steric effect exerted by a substituent on the equilibrium. In the definition below and KH are the rate constants for substituted and unsubstituted compounds. The bulkier the substituent, the more negative the Es. The table lists Es values for several common substituents. X + h2o X + ch3-oh Es=logKY- log K Me 1.Z4 0.78 0.27 0.08 -0.16 0.00 F1.2.21 Meaning of Parabolic Dependence on LogP Hansen's introduction of a parabolic term for logP was a breakthrough that helped make QSAR a more powerful tool. This additional term denotes the existence of an optimum in logP: molecules with higher values start to be trapped in hydrophobic membranes and cannot reach their site of action. % Graph % Equation F1.2.22 The Free-Wilson Analysis In 1964, Free and Wilson derived a mathematical model based on the hypothesis that overall biological activity is the sum of all elementary contributions of the substituents. This approach assumes that substituent effects are additive. F1.2.23 Indicator Variables and Substituent Weights In the Free-Wilson model biological activity is represented by the following equation. Indicator Variables x, (with values of 1 or 0) describe the presence or the absence of certain substituents at each scaffold position. The weight of each substituent is represented by the a, coefficients, u is the overall average activity. F CI Me Ri Contribution = Oi Xi + Q2 X2 + Ch X3 Ri (F) = -0.301 +0+0 Ri (CI) = 0 + 0.207 + 0 Ri (Me) = 0 + 0 + 0.454 F1.2.24 Free-Wilson Structural Matrix The construction of the Free-Wilson QSAR model is based on a structural matrix composed of the x, indicator variables. experimental data meta para position (X) position (Y) log U C obs. structural matrix meta position (X) I para position (Y) # F CI Me F CI Me 1 1 0 0 0 0 0 2 0 0 0 1 0 0 3 0 1 0 1 0 0 4 0 0 1 0 0 1 log l/C obs. F1.2.25 Example of Structural Matrix F1.2.26 Example of Free-Wilson Equation A mathematical procedure known as multiple linear regression is used to derive the QSAR equation from the structural matrix, ultimately leading to the values of a, and u for the 22 antiadrenergic agents. matrix multiple linear regression log 1/ Problem of Correlated Descript nrc U k 1 ■ ■ ■ For the entire lis t B ;ee the navigation panel. I, s F1.5.1 Descriptors Selection As mentioned earlier in this chapter, the number of available descriptors for QSAR analyses is very large. A good model is based on a small number of well-chosen descriptors. When many descriptors are screened, a fortuitous correlation may occur. In the following pages important rules for the selection of relevant descriptors are presented. Compounds selection Descriptors selection Building the QSAR model Validating the model F1.5.2 Methods for Selecting Relevant Descriptors Relevant descriptors can be selected either manually or by using automated approaches. For each method, computer programs are available that help in the selection of relevant descriptors. F1.5.3 Manual Selection of Descriptors The manual method is based on a thorough understanding of the SAR and exploiting intuitions generated by the analyses. For example if preliminary analyses indicate that steric or hydrophobic substituents may increase activity, descriptors such as the molar refractivity (MR) and the hydrophobic substituent constant, tt should be selected in the first place. F1.5.4 Automated Selection of Descriptors The second method looks at the selection of descriptors in an automated manner, using programs that score and rank them. Automated and manual methods can also be combined to select relevant descriptors and select those that are easy to interpret. Modern methods use genetic algorithms based on natural evolution principles (Darwin). is H i 111 i I I 1 i F1.5.5 Systematic Combination of Descriptors In principle the identification of the best descriptors can be accomplished by a systematic evaluation of all their combinations. For each combination, a QSAR equation can be derived and then ranked. The highest-ranked equation will reveal the best subset of descriptors. However this systematic approach is not always feasible: for n descriptors (current software can process 2000), there are 2n-1 different combinations (subsets). In the following pages we present automated methods that circumvent this difficulty. <|> Calculator • Example number of descriptors: number of subsets: I F1.5.6 Methods for Selecting a Subset of Descriptor "Forward regression", "backward elimination" and "stepwise regression" are methods for selecting a subset of descriptors from a large descriptor pool. The process starts with an initial subset of descriptors, then successive small alterations of this subset are made and assessed. If this modification improves the model, the change is accepted, otherwise it is rejected. The treatment is terminated when it is not possible to improve the model further. F1.5.7 Forward Selection The "forward selection" method starts with the single descriptor which best correlates with the dependent parameter. At each subsequent step the method adds the next most contributing descriptor. The process stops when the addition of a descriptor does not improve the model's performance as assessed by appropriate statistical indices. F1.5.8 Backward Elimination The "backward elimination" method starts with a model that includes all the descriptors. At each step the method removes those descriptors that do not degrade the model's performance. The process is stopped when performance starts to decline as assessed by relevant statistical indices. F1.5.9 Stepwise Regression The "stepwise regression" method starts (like in forward selection) with the single descriptor that best correlates with the dependent parameter. At each subsequent step the method adds the next most contributing descriptor and can potentially remove non-contributing descriptors. The process is stopped when additional descriptors do not improve the model or when removing descriptors causes the model's performance to decline, as assessed by appropriate statistical indices. Score F1.5.10 Scaling Descriptors Descriptors represent a broad range of physico-chemical properties. They need to be calibrated in order to provide a good balance of their respective influence when they are combined. Scaling treatment consists of a mathematical operation called "normalization" which sets boundaries for the variation of each descriptor. Descriptor 2 F1.5.11 Correlation Between Descriptors When two descriptors essentially convey the same information about a series of molecules they are said to be correlated. The use of correlated descriptors in the same equation must be avoided, because the information they characterize is over-represented when both are present. A "correlation matrix" provides useful information on the degree of correlation of different pairs of descriptors. MV MR a AAV Highly correlated Not correlated Partially correlated F1.5.12 Example of Correlated Descriptors Consider for example the molecular weight and the number of carbon atoms as two descriptors characterizing a series of alkanes. These two descriptors are highly correlated, which can be shown graphically. # Carbons F1.5.13 Solution to the Problem of Correlated Descriptors When two descriptors are highly correlated, the solution is to remove one of them. The descriptor that carries strong structural information is preferred and the less intuitive one is removed. An alternative solution consists of removing the descriptor that has the highest correlation with the other descriptors. Structure AAW # Carbons 1 44.1 3 * X * 58.1 4 ft * 72.2 5 1 ft ft 86.2 6 Structure MW 44.1 * 58.1 72.2 t * * ft 86.2 F1.5.14 The Holy Grail in QSAR There is a general consensus that in a meaningful QSAR equation, the number of molecules in the training set should exceed the number of descriptors by a factor of 3 to 5. n descriptors Q F1.6 Deriving the Equation: Step 3 The topic Deriving the Equati or i:; Step 3 contains the following 24 pages Deriving The QSAR Equat ior i The Starti I Point: The . j. - L ing oiuay i auie Graphical AnalvE lis of the Data ' LI- . n_ __i _ i i j.: _ onoi ce ot in le iviainemaucai i zquaiion Complexity Levels and Data Overfittin g m i i L M J- i ■ 1 ■ i\ ziatnemati cs are very ^iooj rowerru 1 ■ ii lustration with an Examole 1 ■ / ^ bimpie Model v Comnlex Model 1 ■ c comparing the Two Modi sis ■ F 'rpdintivp Powpr of the Simp le Moripl ' F 'redictive Power of th e Complex Mode :l ■ c complexity Dictated by Predictability nf thea MnH^I i k i 1 ¥ 1 * ■ ■ ■ For the entire lis t G ;ee the navigation panel. i, a F1.6.1 Deriving The QSAR Equation Step 3 consists of deriving the QSAR equation corresponding to the set of descriptors that were selected in the previous step. Compounds selection Descriptors selection Building the QSAR model Validating the model F1.6.2 The Starting Point: The Study Table The starting point for deriving a QSAR equation is the study table. It consists of a spreadsheet with molecules across the rows and molecular characteristics (biological activity, descriptors) down the columns. Typically, the first column indicates the molecular identification (e.g. compound number or name, 2D structure), the second column its activity, and subsequent columns the values of the corresponding descriptors. Property of interest Descriptors Compound Activity LogP MR MW HOMO Density i 9s| -4.03 87.10 332.2 -12.0 1.47 24 -3.68 76.53 324.4 -11.5 1.43 28 -4.34 91.23 290.3 -11.2 1.37 64 -5.19 100.2 310.1 -9.2 1.36 18 -5.59 91.32 291.5 -10.2 1.41 n 52 -4.83 72.12 340.3 -11.3 1.36 F1.6.3 Graphical Analysis of the Data The study table should lead to graphical analyses. This step is of paramount importance and leaves room for "hunches" and preliminary interpretations. This is where the key questions are asked: is there an order? Are the points distributed according to known patterns? Can the recognized trends be translated into physico-chemical expressions? etc... descriptor z descriptor k F1.6.4 Choice of the Mathematical Equation After having identified trends in the system, the correlation process can begin. The initial analyses help guide the choice of the right mathematical equation. This equation should not be treated as a black-box; rather it should contain information that reflects the behavior and allows for interpretation of the system in a structural manner. Sound structural informational content in a QSAR equation is of utmost importance for formulating step 3. F1.6.5 Complexity Levels and Data Overfitting The next hurdle is the mathematical equation. At this stage the complexity of the model depends on both the form of the mathematical equation and the number of descriptors considered. single linear regression parabolic model Descriptor Activity = a (descriptor^ + b Descriptor Activity = a (descriptor^2 + b multiple linear regression: Activity = a(descriptori)+b(descriptor2)+c(descriptor3)+d other models: parabolic, bilinear, probability, equilibrium etc... F1.6.6 Mathematics are Very (too) Powerful QSAR models can be skewed unintentionally by overly powerful mathematical choices. An equation that fits the data of a training set precisely can yield an equation that is perfect mathematically but meaningless for molecules other than those in the training set. For example if the training set consists of 20 molecules, it is always possible to select a set of 20 randomly chosen descriptors and solve the mathematical system for 20 equations and 20 unknowns. This error is known as data-overfitting. 20 equations and 20 unknowns biological activities unknowns descriptors F1.6.7 Illustration with an Example To illustrate the data-overfitting problem, let's take a series of compounds for which the permeability through the blood brain barrier (BBB) has been found to be correlated with their logP and polar surface area. In the following graph we have plotted a hypothetical series of compounds in this space and color-coded them according to their BBB permeability. Compounds colored green are permeable whereas compounds colored red are not. * permeable Polar surface area F1.6.8 A Simple Model A linear model for differentiating between BBB permeable and BBB impermeable compounds can be formulated by drawing a straight line through the logP / Polar surface area space. Most of the compounds on the left side of the line are BBB permeable whereas most of the compounds on its right are BBB impermeable. As the model correctly classifies 45 out of the 50 compounds it has a success rate of 90%. Polar surface area F1.6.9 A Complex Model A model with an improved success rate can be generated by drawing a curved line across the logP / Polar surface area space. This model completely separates the BBB permeable compounds from the BBB impermeable compounds and thus has a success rate of 100%. Polar surface area F1.6.10 Comparing the Two Models Which of the two models better distinguishes BBB permeable from BBB impermeable compounds? Clearly the complex model has a higher success rate. However, by doing so it distorts its shape to correctly classify the outliers thereby completely reflecting the scatter of the training data - it is therefore an overfitted model. On the other hand, the simple model mislabels the outliers on the assumption that they are indeed outliers. outliers Polar surface area Polar surface area F1.6.11 Predictive Power of the Simple Model The simple model predicts that all test compounds lying to the left of the line are BBB permeable and all those lying to the right of the line are BBB impermeable. Assuming that the test compounds are similar to the training compound, the prediction power of this model is expected to be high. Predicted to be impermeable and is probably so Ol o Predicted to be^ permeable and is probably so Polar surface area F1.6.12 Predictive Power of the Complex Model The complex model also predicts that all test compounds lying to the left of the line are BBB permeable and all those lying to the right of the line are BBB impermeable. However, under the same assumption of similarity between test compound and training compound, many of its predictions are expected to be erroneous. Predicted to be permeable but is probably impermeable Predicted to be x impermeable but is probably permeable Polar surface area F1.6.13 Complexity Dictated by Predictability of the Model In the QSAR approach tailoring an equation to the peculiarities of a training set is not a problem. However, forcing the mathematics to fit too closely to the data may lead to meaningless models in terms of predictability (tools for assessing the predictability of a QSAR model will be presented in Step 4). The real issue is to stop the refinements early enough so that the predictive capabilities of the model are not lost. > * Reproducibility Predictibility >OUl +- Qual Complexity level F1.6.14 Single Linear Equation: Mathematical Outline The simplest form of a QSAR equation is a linear model with one descriptor. This simply yields the equation of a straight line of the form y = b0 where b0 indicates the intercept of the line with the y axis and b1 the slope of the line. bM and b1 are calculated as described on the next page. F1.6.15 Calculating bO and b1 b0 and b1 are calculated using the two equations indicated below. The details of such calculations are presented for the Capsaicin example under the heading "Example of simple linear regression". , I (xi-xXyi-y) bi - m_ I ■I bo = y - bix \ f L 1 y — J >0 □ X _j ■ * ? escr v ■ F1.6.16 Multiple Linear Regression: Mathematical Outline It is not always possible to correlate biological activities with a single descriptor (linear model with one descriptor). Given that biological action results from the combined influence of many factors, one can extend the QSAR model to multiple descriptors. Indeed, the observation that several parameters used simultaneously can lead to good models prompted the development of a method referred to as "multiple linear regression" (MLR). In this model linearity is maintained for each of the individual descriptors. coefficients descriptors F1.6.17 Example: MLR vs. Single Linear Models The example of anticonvulsant compounds shown below demonstrates that each descriptor Es, o and logP alone was not able to give a good correlation (r less than 0.40) with the biological activities. However, by using simultaneously logP and a, a significant improvement was made (r=0.80). The addition of Es improves the model even more (r=0.95). This indicates that the biological properties result from the combined action of lipophilicity, steric and electronic effects. \\// bad model |_ good model | | model log VC = 0.009 Es + 3.411 0.03 log VC- -0.626 a +3.314 0.27 log VC = -0.078 logP + 3.432 0.38 log VC = -0.210 logP - 2.214 a + 3.154 0.80 log VC = 0.21 Es - 0.238 logP - 3.81 er + 3.046 0.95 F1.6.18 The Mathematics of MLR: a Single Sample In MLR we try to express activity as a linear combination of descriptors. We recognize the fact that in most cases, our fit to the experimental data will not be perfect and error is usually unavoidable. In the equations listed below, y (the activity) is a scalar; Xj is the value of the descriptor j and bj its associated coefficient; e is the error. In the matrix notation, xT is a row vector of the descriptors and b, a column vector of their associated coefficients. y = biXi + bzXz + D3X3 + ... + bmxm + e m Matrix notation: Y=x7b + e F1.6.19 The Mathematics of MLR: Many Molecules For the case of multiple compounds, the activity values are assembled into a vector y of length n, where n is the number of compounds. The descriptors are collected into an n by m matrix where n again is the number of compounds and m is the number of descriptors. The coefficients are collected into a vector of length m and the errors are collected into another vector of length n. y = xb + e F1.6.20 The Solution of MLR In the MLR formalism we search for the (unknown) set of coefficients b, which, when multiplied by the (known) descriptors, best approximates the (known) activity data (equation 1). A solution to this problem can be obtained through a matrix inversion procedure (equation 2). ^coefficients • example y = x b + e (1) The transposed of the original descriptors matrix. A transposed matrix replaces columns with rows and vice versa. The n-l" indicates matrix inversion b = (XrX)1Xry (2) The unknown vector The original The known vector of coefficients descriptors matrix of activities F1.6.21 Analysis of the MLR Equation ■ One of the purposes of QSAR analyses is t compounds and to assist drug design. In importance of the descriptors vary in the folk governed in the first place by hydrophobicity MR). ces governing the activity of a particular class of below QSAR analyses reveal that the relative ■ > Es > MR; therefore the biological activities are 7) and to a lesser extent by stehe effects (Es and iptors og ■ 1 Es MR F1.6.22 Non-Linear Equations A non-linear equation is an extension of a multiple linear regression. In some systems the linearity may not be sufficient to achieve a good correlation. Hansch was the first to introduce a parabolic term, and a complex biological process can be satisfactorily modeled by non-linear equations. Planar cubic spherical ellipsoidal F1.6.23 Example of Non-Linear Model In the example below, the anticonvulsant activities of a set of molecules was initially found to be linearly correlated with logP. However, it is implausible to assume that the biological activity can increase indefinitely by increasing the lipophilicity of the molecules. It is known that highly lipophilic compounds cannot reach their site of action, because they are trapped in lipophilic environments. It is therefore more realistic to improve the initial model using a non-linear equation. The modified equation proved to be correct and revealed the existence of an optimum logP value, information that could not be derived from molecules with a small range of logP values. % linear model * on-linear model R2 Rl J log (1/C) = 0.73 logP + 2.5 Log P F1.6.24 Typical Non-Linear Equations There are many reasons why the use of non-linear models is justified, including the kinetics of the drug transport, the equilibrium control of its distribution, allosteric effects, different pharmacokinetics, metabolism, solubility etc... The following are examples of non-linear models that have proved to be valid at ieast for special and complex biological systems. Parabolic Model (Hansch) log 1/C = a (logP)2 + b logP + c Probability Model (McFarland) log 1/C - a logP - 2a log (P+l) + c Equilibrium Model (Hyde) log 1/C = a logP - log (aP+1) + c Bilinear Model (Kubinyi) log 1/C = a logP - b log (pP+1) + c F1.7 Validating the Model: Step The topic Validating the Model: Step 4 contains the following 19 pages: ■ Tools for Assessing the Quality of a Model Predictive and non- ■Predictive Models ■ The Standard Deviati on _ i _ i___ i__ i correlation inaex r ■ The Mathematics of r2 ■ t-test for Single Descriptors and Significance of r2 ■ c m. Shane of t -distribution and Number of Molecules ■ c student" st -test Procedure ■ F-teact fnr Assessing the Signifi nannp nf r2 i w vi ■ F 'erformmg the F-test For the entire list, see the navigation panel. F1.7.1 Tools for Assessing the Quality of a Model Efficient tools are necessary for assessing the validity of a QSAR model. Numerical analyses or statistical methods provide a variety of indexes that serve to evaluate the quality of the model and its limitations. In the following pages we present some of these tools and explain how to use them. F1.7.2 Predictive and non-Predictive Models Broadly speaking there are two groups of indices: (1) those that indicate how well the QSAR equation can "reproduce" the experimental data and (2) those that can tell how far the model can be extrapolated to new molecules. F1.7.3 The Standard Deviation The easiest way to "validate" a QSAR model is to calculate the standard error or standard deviation (SD or s), which is calculated as the average squared deviation of each number (the "residuals") from the mean. This index reflects how much the deviation between the data and the model is. The smaller the SD, the more the model is considered of good quality. */ s calculation % example The Equation F1.7.4 Correlation Index r2 The most frequently used index for evaluating the performance of a QSAR model is r2 (squared correlation coefficient), r2 measures the degree of correlation between the activity values calculated by the model and those measured experimentally. The value of r2 can range between 0 (no correlation) to 1 (perfect correlation). #r2=1 m r2=0.5 « r2=0 r2= 1 perfect correlation > Act Measured Calculated Activity F1.7.5 The Mathematics of r2 Mathematically, r2 is calculated by dividing the fraction of variance explained by the model (the "explained sum of squares", ESS) by the original variance (the "total sum of squares", TSS). ESS, the fraction of variance explained by the model is equal to the total variance (TSS) minus that portion of the variance which was not explained by the model (residual, RSS). I *1 jr-l I i-f C/ll I n ri j f\t \t\2 F1.7.6 TSS, the Total Variance In order to obtain RSS, the variance explained by the QSAR model, we start from the fact that the total variance is the sum of the explained and unexplained variances. Thus, the explained variance is the difference between the total variance and the unexplained variance. That portion of the variance which is left unexplained by the QSAR model (unexplained variance) can be obtained by finding the difference between the measured activity and the predicted activity (as given by the regression line). 71 F1.7.8 t-test for Single Descriptors and Significance of r2 r2 alone is not sufficient to determine whether the relationship has occurred by chance; its significance can be calculated using the t-statistic for single descriptors as follows. We repeat the process of deriving of a QSAR equation and calculate the resulting r2 values many times, each one using a different descriptor. If the number of molecules is large (> 30), the sampling distribution of the resulting r2 values will have a normal (i.e., Gaussian) shape. If the number of molecules is small, it will have a shape known as a t-distribution. Normal (gaussian) distribution t-distribution Values on the x-axis represent standard deviations from the mean located at X = 0. 2151 F1.7.9 Shape of t-distribution and Number of Molecules A value r2 = 1 will always be obtained for a set of two molecules irrespective of the descriptor used for the QSAR analysis however, as the number of molecules increases, the probability of obtaining large r2 values with irrelevant descriptors decreases. This probability corresponds to the area under the t-distribution curve (see below), away from the center (where r2 = 0). The shape of the t-distribution therefore depends on the number of molecules used in the analysis. /-distribution for 3 molecules /-distribution for 30 molecules F1.7.10 Student's t-test Procedure The Student t-test employs the t-distribution to test whether the correlation coefficient obtained from the QSAR analysis is significantly different from 0. The larger the t-value, the larger the probability that r2 significantly differs from 0; that is, the larger the probability that the descriptor used for the analysis is relevant to the activity. Technically, the steps involved in the Student t-test are as follows. % Overview « Step 1 « Step 2 • Step 3 m Step 4 1. Calculate t according to the above equation. t = r N-2 7v 2. Select a significance level (e.g., 0.05). (see step 2) 3. Look up the t value from a t-distribution derived for the correct number of data points (N) at the selected significance level. 4. If the calculated t-value is larger than the listed t-value, then the regression equation is significant at this significance level. F1.7.11 F-test for Assessing the Significance of r2 The F-test is an extension of the t-test for the case of many descriptors. Like the t-test it tests (and hopefully rejects) the assumption that the model did not explain any of the original variance in the data set (i.e., ESS = 0). Like the t-test, the F-test uses an F-distribution which, similar to the t-distribution depends on the number of compounds and descriptors. Molecules = 10 Molecules = 100 Descriptors = 4 Descriptors - 10 0.0 1,7 3.3 5,0 6,7 8.3 10.0 0.0 1.7 3.3 5.0 6.7 8.3 10.0 F1.7.12 Performing the F-test The F-test employs the F-distribution to test whether the correlation coefficient obtained from the MLR analysis significantly differs from 0. The larger the F-value, the larger the probability that r2 significantly differs from 0; i.e. the greater the probability that the descriptor used for the analysis is relevant to the activity. Technically, the steps involved in the F-test are as follows. ESS RSS r2 = and p2 = 1 - TSS TSS F = ESS N-k-1 k RSS i i TSS - ESS I 1 _ TSS 1-r2 = RSS ESS RSS l-r: Calculate F according to this equation: N - number of molecules, k - number of descriptors r2(N-k-l) k(l-r2) F1.7.13 F-test Procedure The application of the steps involved in evaluating the significance of r2 for the Capsaicin analogs using the F-test proceeds as follows: */ Procedure • F-table Calculate F: F = r2(N-k-l) k(l-r2) ; r2 = 0.92; N=8; k=3 0.92(8-3-1) F =---- = 15.33 3(1-0.92) Select a significance level (p): p = 0.01 Look up the F value from an F-distribution with N=7, k = 1, p = 0.01: F = 7.59 tab The calculated F value (15.33) is larger than the tabulated F value (7.59). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than 1%. F1.7.14 Assessing the Predictive Power of a Model r2, t and F are indices that can be generated to evaluate QSAR results. However, these parameters basically only tell us about the ability of the QSAR model to reproduce the data from which it was derived and not its aptitude to predict the activities of new compounds. Two methods are presented in the following pages to estimate the predictive power of a QSAR model. F1.7.15 The Test Set Method The first method is known as the "test set method" and consists of partitioning the initial data into two sets, a preferred strategy when a large set of compounds is available. The initial data set is randomly divided into two parts; the first one is used to build a QSAR model and the second one to validate this model. Training set Test set F1.7.16 The Cross Validation Method The second method is known as "the cross validation method" - it is preferred when the size of the data set is too small. In this method the data are randomly divided into N equal parts; N-1 parts are used to build the model which is then used for the remaining Nth part to predict the activities of the corresponding molecules. The procedure is repeated until the activities of all compounds have been predicted independently. Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 F1.7.17 Limits of the Cross Validation Method With the cross validation method, the QSAR model that is ultimately used to predict the activities of new compounds is derived from all N data points and is therefore different from the N partial QSAR models (i.e. those derived from the N-1 data points). Therefore cross validation does not provide us with trie predictive power of a specific QSAR equation but rather with an estimate of our ability to make predictions for compounds similar to those used in our QSAR analysis. Compounds used in original QSAR analysis Prediction estimates made by cross validation apply Prediction estimates made by cross validation do not apply Descriptor 1 F1.7.18 The Predictive Index Q2 The predictive power of the model, termed Q2, is computed by analogy with r2, the difference being the use of the PRESS (predicted sum of squares) rather than the RSS (residual sum of squares) in the numerator. PRESS is calculated as the difference between the measured activity and the predicted activity for the test set compounds. r2 = 1 - t Q2 = 1 - PRESS RSS = (ycoici - yi): i=l t N PRESS = ^ (ypred,, - ys): i=l F1.7.19 Summary When discussing mathematical tools available for assessing the quality of a QSAR model we saw that (1) the standard deviation is an isolated "absolute" index of local meaning; (2) with r2 it is possible to compare different models, but this index is only mathematical - not statistical; (3) t and F have a statistical content that can be used for single and multiple linear regression respectively; however they only measure the ability of the QSAR model to reproduce the data from which it was constructed. log C = 1.14 log P + 0.16 correlation coefficient for assessing the quality of the model F-value for assessing the statistical significance n=25; rz=0.91; s = 0.155; F = 66.4; Q2 = 0.875 number of molecules standard deviation regression coefficient for measuring the predictability F1.8 Example of Simple Linear Regression The topic Example of Simple Linear Regression contains the following 11 pages: ■ Example of Capsaicin Analogs ■ Relevant Descriptors of Capsaicin Analogs ■ The Capsaicin Study Table ■ Graphical Analysis of Capsaicin Analogs ■ Deriving a QSAR Linear Equation ■ Experimental vs. Calculated Values ■ Calculating r2 for the Capsaicin analogs ■ t-test for the Capsaicin Analogs ■ F-test for a Series of the Capsaicin Analogs ■ The QSAR Equation for the Capsaicin Analogs ■ Predicting the Activities of Unknown Compounds F1.8.1 Example of Capsaicin Analogs Capsaicin analogs were studied for their analgesic properties and we will use this study to illustrate the derivation of a simple QSAR model. The biological activities (EC50) were measured for some analogs as indicated below. The question is whether on the basis of these data, it is possible to develop a QSAR model and predict the biological activities of new compounds. F1.8.2 Relevant Descriptors of Capsaicin Analogs The selection of descriptors that correlate with the target biological activity is mandatory for the derivation of a meaningful QSAR model. For Capsaicin analogs, biological activity appears to be influenced by the lipophilicity of the substituent R. Following this assumption the descriptors deemed most suitable are the molar refractivity (MR) and the hydrophobic substituent constant tt. F1.8.3 The Capsaicin Study Table The following table summarizes the MR and tt values which were calculated for the seven Capsaicin analogs. As discussed above, activities (EC^n) are expressed as their log values. F1.8.4 Graphical Analysis of Capsaicin Analogs For Capsaicin analogs, if we plot the values from the study table for MR and tt, respectively, there seems to be a weak correlation between the biological activity and the molar refractivity (MR). However, the hydrophobic substituent constant tt shows a possible linear correlation. F1.8.5 Deriving a QSAR Linear Equation F1.8.6 Experimental vs. Calculated Values There is a difference between the experimental and the calculated values as shown below, continue # log EC50 obs. log EC50 calc. D 1.07 0.79 2 0.09 0.21 3 0.66 1.02 D 1.42 1.26 5 -0.62 -0.81 6 0.64 0.65 □ -0.46 -0.12 Experimental Calculated F1.8.7 Calculating r2 for the Capsaicin analogs For Capsaicin analogs, r2 is calculated as follows. — 1.07 + 0.09 + 0.66 + 1.42 + (-0.62) + 0.64 + (-0.46) y =-j- = 0.4 TSS = (1.07-0.4)2 + (0.09-0.4)2 + (0.66-0.4)2 + (1.42-0.4)2 + (-0.62-0.4)2 + (0.64-0.4)2 + (-0.46-0.4)2 = 3.49 RSS = (0.28)2 + (-0.12)2 + (-0.36)2 + (0.16)2 + (0.19)2 + (-0.01)2 + (-0.34)2 = 0.40 2 3.49 - 0.40 3.09 - „ 3.49 3.49 F1.8.8 t-test for the Capsaicin Analogs The steps involved in evaluating the significance of r2 are as follows: % t calculation • t-table Select a significance level (p). Look up the t value from a t-distribution with A/=7, p=0.01: The calculated t value (6.3604) is larger than the tabulated f value (2.998). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than 1%. F1.8.9 F-test for a Series of the Capsaicin Analogs The steps involved for evaluating the significance of r2 using the F-test proceed as indicated below. The F-test analyses finally indicate that a significant correlation is obtained and the probability of a chance correlation is less than1%. # F calculation F-table Calculate F: F = r2(N-k-l) k(l-p2) ; r2 = 0.89; N=7; k=l 0.89(7-1-1) F =---- = 40.45 1(1-0.89) Select a significance level (p): p = 0.01 Look up the F value from an F-distribution with N=7, k = 1, p = 0.01: F = 12.25 The calculated F value (40.45) is larger than the tabulated F value (12.25). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than 1%. F1.8.10 The QSAR Equation for the Capsaicin Analogs QSAR studies reveal the importance of lipophilicity in the analgesic properties of a series of Capsaicin analogs as indicated by the good correlation found with the tt descriptor. The correlation coefficient r2 is 0.89 and analyses of the significance of the equation (t-test and F-test) show that there is less than a 5% chance that the relationship is due to chance. This validates the use of tt as a descriptor for the structure-activity relationships. r2=0.89; s=0.28; t=6.36; F=40.45 F1.8.11 Predicting the Activities of Unknown Compounds The derived QSAR model can be used to predict the biological activities of novel capsaicin analogs by introducing their corresponding tt values in the QSAR equation. For example, the biological activity of the amide analog indicated below is predicted with an ECcn of 0.98 uM. K