©   F1 QSAR Principles and Methods
Quantitative Structure Activity Relationship (QSAR) is involved in building mathematical models for correlating molecular structures with molecular properties. In this section we introduce the notion of molecular descriptors and present the QSAR model and its validation.
Author(s): Hanoch Senderowitz (Predix Pharmaceutical), Claude Cohen (Synergix)
Prerequisites: None
Number of Pages:    191 (194 Screens)
Last updated: April 2004
d Voice:
available
Q F1.1 Introduction to QSAR
The topic Introduction to QSAR contains the following 20 pages:
■ Molecular Structure and Molecular Properties
■ Structure-Property Relationships: Example 1
■ Structure-Property Relationships: Example 2
■ Structure-Property Relationships: Example 3 - What is QSAR?
■ What is QSPR?
■ Focus on a Single Property at a Time " Molecular Descriptors
" Examples of Molecular Descriptors
■ The QSAR Equations
■ Types of Molecular Descriptors
■ Molecular Descriptors: 1D
■ Molecular Descriptors: 2D
For the entire list, see the navigation panel.
F1.1.1 Molecular Structure and Molecular Properties
One of the most pervasive postulates in the life sciences is that all molecular properties are coded by and consequently result from molecular structure. Some examples of structure-property relationships are illustrated on the following pages.
o
CN
Molecular Structure
• biological properties
• chemical properties
• physical properties
• electronical properties 9 et c. •.
F1.1.2 Structure-Property Relationships: Example 1
Paracetamol selectively inhibits the cyclooxygenase enzyme COX-3 found in the brain and spinal cord and consequently relieves pain and reduces fever.
Structure
Property
raraceiamoi
Relives
F1.1.3 Structure-Property Relationships: Example 2
Cyanide exerts its toxicity by inhibiting cytochrome-c oxidase, the terminal enzyme of the respiratory chain, leading to insufficient utilization of oxygen and suffocation. Inhibition occurs through binding to the ferric ion of the cytochrome.
Structure
Property
II
Cyanide
F1.1.4 Structure-Property Relationships: Example 3
Saccharin (usually sold as sodium saccharin) binds to the sweet taste T1R3 receptor located in the plasma membrane of the sweet-taste sensory cells located in the taste buds. Binding of saccharin to T1R3 initiates a cascade of events in the taste-sensory cell that eventually releases a signaling molecule to an adjoining sensory neuron, causing the neuron to send impulses to the brain. In the brain, these signals cause the actual sensation of sweetness.
Structure Property
Saccharin
F1.1.5 What isQSAR?
Molecules exert their biological effect by binding to their respective receptors, a phenomenon that in turn is governed by their molecular structures (and the molecular structure of the receptor). QSAR (Quantitative Structure Activity Relationship) attempts to formulate the relationship between structure and activity as a mathematical model.
Biological effect = £ (Molecular Structure)
Quantitative Structure Activity Relationships
F1.1.6 What isQSPR?
The biological effect is just one example of molecular properties. QSPR (Quantitative Structure Property Relationship) is an extension of QSAR and is designed to formulate the relationship between structure and any molecular property as a mathematical model. Other properties include for example: solubility, oral bioavailability, metabolic stability and cell permeability.
Property X =
ft
F1.1.7 Focus on a Single Property at a Time
No single QSPR model can capture the direct connection between all the properties of a compound and its molecular structure; oniy a single property is handled at a time.
Property 1
Ad z
Property 2
coon
Property 3
Property 4
F1.1.8 Molecular Descriptors
Thus, the derivation of a direct relation with the molecular structure of one single property is extremely challenging. However, structural factors known as molecular descriptors that influence the molecular property can be identified. For this reason, the QSAR model correlates the property with molecular descriptors.
F1.1.9 Examples of Molecular Descriptors
Examples of molecular properties with their associated descriptors are listed in the following table. Later on in this chapter the nature and the meaning of some QSAR descriptors are presented.
olecular Property
Descriptors
ipophilicity
7t, log P, Rm, f
teric Properties
Es, AAR, AAV, parachor
lectronic Properties
F1.1.10 The QSAR Equations All QSAR equations have a molecular property expressed as a function of specific descriptors. They differ in terms of the property they are attempting to correlate, the descriptors they use and the mathematical expression of the model.		
Oral bioavailability =	^(descriptors s	;etl)
Cell permeability =	j2 (descriptors 5	;et2)
Toxicity =	^(descriptors £	;et3)
Metabolic stability =	Ja (descriptors «	»et4)
Receptor binding =	^(descriptors s	;et5)
F1.1.11 Types of Molecular Descriptors
Molecular descriptors can be classified according to the dimensionality of the molecular structure from which they are derived. 1D descriptors are derived from the chemical formula, 2D descriptors are derived from a 2D (chemdraw-like) structure and 3D descriptors are derived from the 3-dimensional structure.
F1.1.12 Molecular Descriptors: 1D
The chemical formula constitutes a 1-Dimensional representation of the molecular structure from which 1D descriptors can be derived. Such descriptors are based exclusively on the type of atoms which make up the molecule.
C7H5NO3S
Molecular Weight (gr/mol): 183.2
F1.1.13 Molecular Descriptors: 2D
A Chemdraw-like structure constitutes a 2-Dimensional representation of the molecular structure from which 2D descriptors can be calculated. In addition to types of atoms, 2D descriptors also incorporate the bonding pattern of the molecule.
Rotatable bonds: 0        H-bond donors: 1
F1.1.14 Molecular Descriptors: 3D
3D descriptors derived from a 3D molecular structure take the spatial arrangement of the atoms in the molecule into account.
Dipole Moment: 2.2 De byes
F1.1.15 A Multitude of Molecular Descriptors
The number of descriptors that can be derived from a molecular structure is virtually unlimited. Currently available software packages can calculate thousands of descriptors. For example the DRAGON program calculates 1612 descriptors distributed into 20 categories.
constitutional descriptors WHIM descriptors topo)ogicQ| chapge m
topological descriptors
molecular properties
Randic molecular profiles
RDF descriptors
functional group counts
information indices
BCUT descriptors
eigenvalue-based indices
atom-centred fragments
walk and path counts
geometrical descriptors
edge adjacency indices
charge descriptors 3D-MoR5E descriptors
connectivity indices
2D autocorrelations
GETAWAY descriptors
F1.1.16 Biologically Relevant Descriptors
When constructing a QSAR model, the key is to use descriptors that are relevant to the specific property of interest. These "biologically relevant descriptors" help generate a model that differentiates between molecules that possess the property of interest and those that do not.
4j> Non-Relevant Descriptor
% Relevant Descriptor
Understanding
Design
activity, design lese reasons in understand the
F1.1.18 Understanding Structure-Activity Relationships
A good model can reveal information about the receptor's binding site. For example a correlation with electronic descriptors may indicate that the biological activities could be due to the chemical reactivity of the compounds, or alternatively, a correlation with hydrophobic descriptors may reveal the existence of a hydrophobic pocket in the receptor.
F1.1.19 Designing Compounds with Improved Activities
Once a QSAR model is obtained and reproduces the known data satisfactorily, it can be exploited to predict the biologicai activity of not yet synthesized analogs. This is of paramount importance in lead optimization and represents one of the most popular uses of the QSAR approach.
Compounds not yet synthesized
QSAR model
*
Prediction of the biological activities
F1.1.20 Reducing a Virtual Library to a Practical Size
The recent explosion in combinatorial chemistry has added a new dimension to the QSAR approach by reducing a huge virtual library to a manageable size for combinatorial synthesis and high through-put screening.
Virtual Library Generator
Biological Activity Prediction
30254
To Chemical Synthesis
F1.2 I he Foundations of QSAR
The topic The Foundations of QSAR contains the following 28 pages:
■ Birth of QSAR
■ The Foundations of QSAR
■ The Hammett Contribution
■ Dissociation Constants of Substituted Benzoic Acids
■ Dissociation of Substituted Phenylacetic Acids
■ Linear Free Energy Relationship " The Hammett Equation
■ The Meaning of p
■ The Meaning of o
• Examples of a Constants
■ Predicting the pKa of Benzoic Acid Compounds
■ Hansch Contribution
■ The Importance of Lipophilicity
For the entire list, see the navigation panel.
F1.2.1 Birth of QSAR
QSAR dates back to the 19th century with the work of Cros (1863) who first observed an inverse correlation between the toxicity of alcohols and their water solubility. Other important milestones include work by Crum-Brown and Frazer who related physiological action to chemical constitution (1868). A few years later Horst, Overton and Richet independently observed that the toxicity of organic compounds depended on their lipophilicity/solubility. This discovery was followed by research by Meyer and Overton, who proved that anesthetic potency correlated well with partition coefficients (1899).
• 1863 Cros inverse correlation between toxicity
and water solubility of alcohols
1868 Crum-Brown A Frazer "physiological action" is a function
of "chemical constitution"
• 1890's Horst & Overton     toxicity of organic compounds
depend on their lipophilicity.
• 1893 Richet "more they are soluble, less they
are toxic"
1899 Meyer-Overton partition coefficients correlate
with anesthetic potency
F1.2.2 The Foundations of QSAR
During the first half of the 20th century, Louis Hammett laid the foundation for modern QSAR by correlating electronic properties of organic acids and bases with their equilibrium constants and reactivity. An important landmark in the development of QSAR took place in 1964 with the introduction of the Free-Wilson method and Hansch analysis. This section covers these three seminal contributions to QSAR in some detail.
• Louis Hammett
• Free-Wilson
• Corwin Hansch
F1.2.3 The Hammett Contribution
The dissociation of HA organic acids is a process by which a proton (H+) is removed from the neutral compound, leaving behind a negatively charged species (A). The extent of the reaction is measured by the dissociation constant K, Louis Hammett observed that the dissociation constants of aromatic acids are influenced by the electronic properties of the substituents on the phenyl ring.
HA =5=^ H* + A~
=  [//*] [A']
[HA]
F1.2.4 Dissociation Constants of Substituted Benzoic Acids
The dissociation constants of substituted benzoic acids indicate that electron withdrawing groups increase dissociation while electron donating groups decrease it.
ft p-Et % Benzoic Acid ft p-N02
F1.2.5 Dissociation of Substituted Phenylacetic Acids
A similar effect exists for other equilibria such as substituted phenylacetic acids
• p-Et % Phenylacetic Acid
« p-N02
COOH COO-
I	electron withdrawing	
A	\ dissociation Constant (10~5) ■5.2	
F1.2.6 Linear Free Energy Relationship
When plotting the quantity log(K/Ko) for benzoic acids on the X axis, where K and Ko refer to the unsubstituted and substituted compounds, respectively, and the corresponding values measured for the same set of substituents in phenylacetic acids on the Y axis, Hammett obtained a straight line. Because of the association between dissociation constants and free energies [AG=-RT Log(K)] this phenomenon is known as the linear free energy relationship.
Benzoic Acid
R	K	
N02	'.05 x 10s	
	4.4 x ZO"5	-0.15
H	6.2 x 10 6;(kg)	0
Phenylacetic Acid		
		
N02	14 t v 10 E	0.43
Et	4.2 x 10 ?	
H		0
N02
U
u V
1 £ 1t
c a.
-0.2 y/0
-.   -o.i ■
0.2 0.4 0.6 0.8 Benzoic Acids fog(K/Ko)
F1.2.7 The Hammett Equation
The straight line described on the previous page can be written as a linear equation, the Hammett equation. Note that p is related to a given scaffold (e.g. phenylacetic acids), whereas a o is a descriptor of a substituent and describes its influence on the dissociation constant. It is positive for electron withdrawing substituents and negative for electron donating substituents.
p  pertains to a given equilibrium as compared to the benzoic acid equilibrium.
CT  is a descriptor of a substituent
F1.2.8 The Meaning of p
p describes the magnitude of the effect a substituent can exert on the dissociation reaction of a given scaffold. As the distance between the substituent and the dissociated proton increases, its influence on the dissociation reaction decreases and so does the value of p.
Benzoic Acid * Phenylacetic Acid « Phenylpropionic Acid
F1.2.9 The Meaning of a
a describes the effect of substituents on the dissociation reaction. Substituents on the phenyl ring can increase or decrease the equilibrium constant by stabilizing or destabilizing the anionic form via the formation of a positive or negative partial charge at C1.
H3C
-5
l
COOH
Destabilizes anionic form Decreases dissociation
F1.2.10 Examples of o Constants
Electron donating substituents have negative o values, whereas positive as correspond to electron withdrawing groups. Note that a values differ depending on whether the substituent is meta or para (sigma values are clickable).
F1.2.11 Predicting the pKa of Benzoic Acid Compounds
The Hammett equation is an example of a QSPR equation. It correlates a molecular property, the dissociation constant, with a set of molecular descriptors (o and p). It can be used to predict the pKa of benzoic acid analogs. When a molecule has multiple substituents, the a values are summed to yield the total value for the compound, as shown in the following example.
pKacd = 4.2 - 1.00 (0.71 - 0.13 + 0.71) = 2.91
F1.2.27 Predictability of the Model
The experimental and calculated values of the antiadrenergic molecules of the training set are indicated below and show that the Free-Wilson model reproduces the biological activities well. Moreover the equation can be used to predict the biological activities of new not yet synthesized analogs.
Q F 1.3 Design of a QSAR Model
The topic Design of a QSAR Model contains the following 3 pages:
■ Embarking on the Design of a QSAR Model
" The Four Steps
a An Iterative Process
F1.3.1 Embarking on the Design of a QSAR Model
The planning of a QSAR model must be carefully managed. In this section we will explore the methodology for designing a QSAR model in some detail, present the ideas and statistical concepts behind the QSAR model, the rules that need to be followed and the errors that should be avoided.
descriptors ?
parabolic ?
equations ?
near ?
how many molecules 7
training set ?
correlation coefficient ?
What are the requirement ?
predictive ?
trend ?
normalize decriptors ?
I
molecule to synthesize ?
F1.3.2 The Four Steps
To construct a QSAR model the following steps should be followed: (1) assemble a sufficiently large and diverse set of compounds along with their biological activities; (2) select a set of descriptors which is likely to be related to the biological activity of interest; (3) formulate a mathematical equation that reflects the relationship between the biological activity and the chosen descriptors, and finally (4) validate the QSAR model.
• 1. Compounds Selection
• 2. Descriptors Selection
• 3. Building the QSAR model
• 4. Methods for Validating the model.
F1.3.3 An Iterative Process
Constructing a QSAR model is an iterative process. First, the QSAR equation is derived from an initial set of descriptors. Attempts are then made to improve this model by adding or removing descriptors and refining the mathematical equation, in an iterative fashion.
Compounds selection
Descriptors selection
@ F'lA Compounds Selection: Step 1
The topic Compounds Selection: Step 1 contains the following 5 pages:
" Compounds Selection
■ Predictions by Interpolation
■ Example of Extrapolative Model
■ Identification of Outliers
■ Biological Activities in Terms of Log 1/C
F1.4.1 Compounds Selection
The selection of the compounds is the first step in building a QSAR model and consists of assembling a sufficiently large and diverse set of compounds with known biological activities. The molecules should be selected with great care in order to define a set of compounds that is homogenous and represents the system well.
Compounds selection
Descriptors selection
Building the QSAR model
Validating the model
F1.4.2 Predictions by Interpolation
The compounds selected for a QSAR analysis should cover a large range of values for those descriptors believed to be relevant to biological activity. This increases the probability that future compounds will have descriptors within this range and allow predictions to be interpolative rather than extrapolative. As a rule, interpolative predictions are more accurate than extrapolative predictions.
Poor compound selection
Better compound selection
Descriptor n
Descriptor s
Selected compound
Interpolative zone Extrapolative zone
F1.4.3 Example of Extrapolative Model
Extrapolating a model for values that are outside the range of the training set may lead to incorrect predictions. In the following example the experimental points lie in a straight line, however at higher values the model is more complex and no longer linear.
F1.4.4 Identification of Outliers
QSAR modeling is based on the assumption of homogeneity and an absence of influential outliers in the training set. An outlier can be a molecule acting according to a different mechanism of action, an improper biological activity as reported by another laboratory, or simply an incorrect value (experimental or typographic error). Repeat measurements of biological activities and using the greatest number of molecules helps reduce the distortions introduced by outliers.
% with outlier • without outlier
@ F15 Descriptors Selection: Step 2
The topic Descriptors Selection: Step 2 contains the following 14 pages:
■ Descriptors Selection
■ Methods for Selecting Relevant Descriptors
■ Manual Selection of Descriptors
■ Automated Selection of Descriptors
■ Systematic Combination of Descriptors
■ Methods for Selecting a Subset of Descriptor
■ Forward Selection
■ Backward Elimination
■ Stepwise Regression ' Scaling Descriptors
■ Correlation Between Descriptors
■ Example of Correlated Descriptors
■ Solution to the Problem of Correlated Descriptors
For the entire list, see the navigation panel.
F1.5.1 Descriptors Selection
As mentioned earlier in this chapter, the number of available descriptors for QSAR analyses is very large. A good model is based on a small number of well-chosen descriptors. When many descriptors are screened, a fortuitous correlation may occur. In the following pages important rules for the selection of relevant descriptors are presented.
Compounds selection
Descriptors selection
Building the QSAR model
Validating the model
F1.5.2 Methods for Selecting Relevant Descriptors
Relevant descriptors can be selected either manually or by using automated approaches. For each method, computer programs are available that help in the selection of relevant descriptors.
F1.5.3 Manual Selection of Descriptors
The manual method is based on a thorough understanding of the SAR and exploiting intuitions generated by the analyses. For example if preliminary analyses indicate that steric or hydrophobic substituents may increase activity, descriptors such as the molar refractivity (MR) and the hydrophobic substituent constant, tt should be selected in the first place.
F1.5.4 Automated Selection of Descriptors
The second method looks at the selection of descriptors in an automated manner, using programs that score and rank them. Automated and manual methods can also be combined to select relevant descriptors and select those that are easy to interpret. Modern methods use genetic algorithms based on natural evolution principles (Darwin).
m
F1.5.5 Systematic Combination of Descriptors
In principle the identification of the best descriptors can be accomplished by a systematic evaluation of all their combinations. For each combination, a QSAR equation can be derived and then ranked. The highest-ranked equation will reveal the best subset of descriptors. However this systematic approach is not always feasible: for n descriptors (current software can process 2000), there are 2n-1 different combinations (subsets). In the following pages we present automated methods that circumvent this difficulty.
% Calculator # Example
number of descriptors: number of subsets: I
F1.5.6 Methods for Selecting a Subset of Descriptor
"Forward regression", "backward elimination" and "stepwise regression" are methods for selecting a subset of descriptors from a large descriptor pool. The process starts with an initial subset of descriptors, then successive small alterations of this subset are made and assessed. If this modification improves the model, the change is accepted, otherwise it is rejected. The treatment is terminated when it is not possible to improve the model further.
F1.5.7 Forward Selection
The "forward selection" method starts with the single descriptor which best correlates with the dependent parameter. At each subsequent step the method adds the next most contributing descriptor. The process stops when the addition of a descriptor does not improve the model's performance as assessed by appropriate statistical indices.
F1.5.8 Backward Elimination
The "backward elimination" method starts with a model that includes all the descriptors. At each step the method removes those descriptors that do not degrade the model's performance. The process is stopped when performance starts to decline as assessed by relevant statistical indices.
F1.5.9 Stepwise Regression
The "stepwise regression" method starts (like in forward selection) with the single descriptor that best correlates with the dependent parameter. At each subsequent step the method adds the next most contributing descriptor and can potentially remove non-contributing descriptors. The process is stopped when additional descriptors do not improve the model or when removing descriptors causes the model's performance to decline, as assessed by appropriate statistical indices.
F1.5.10 Scaling Descriptors
Descriptors represent a broad range of physico-chemical properties. They need to be calibrated in order to provide a good balance of their respective influence when they are combined. Scaling treatment consists of a mathematical operation called "normalization" which sets boundaries for the variation of each descriptor.
Descriptor 2
F1.5.11 Correlation Between Descriptors
When two descriptors essentially convey the same information about a series of molecules they are said to be correlated. The use of correlated descriptors in the same equation must be avoided, because the information they characterize is over-represented when both are present. A "correlation matrix" provides useful information on the degree of correlation of different pairs of descriptors.
F1.5.12 Example of Correlated Descriptors
Consider for example the molecular weight and the number of carbon atoms as two descriptors characterizing a series of alkanes. These two descriptors are highly correlated, which can be shown graphically.
# Carbons
F1.5.13 Solution to the Problem of Correlated Descriptors
When two descriptors are highly correlated, the solution is to remove one of them. The descriptor that carries strong structural information is preferred and the less intuitive one is removed. An alternative solution consists of removing the descriptor that has the highest correlation with the other descriptors.
Structure	MW	# Carbons
	44.1	3
	58.1	4
m    * *	72.2	5
rf    % %	86.2	6
Structure	MW
yV	44.1
	58.1
	72.2
% * % t» %	86.2
F1.5.14 The Holy Grail in QSAR
There is a general consensus that in a meaningful QSAR equation, the number of molecules in the training set should exceed the number of descriptors by a factor of 3 to 5.
n descriptors
molecules	activity	di	d2	d3	d„
1			0.2		
2	0.50	3.5	0.5		54
3	2.10	5.1		21	53
4	-0.70^			^-7	
> 3n
@ F16 Deriving the Equation: Step 3
The topic Deriving the Equation: Step 3 contains the following 24 pages:
■ Deriving The QSAR Equation
■ The Starting Point: The Study Table
■ Graphical Analysis of the Data
* Choice of the Mathematical Equation
■ Complexity Levels and Data Overfitting
■ Mathematics are Very (too) Powerful
■ Illustration with an Example
■ A Simple Model
■ A Complex Model
■ Comparing the Two Models
■ Predictive Power of the Simple Model
■ Predictive Power of the Complex Model
■ Complexity Dictated by Predictability of the Model
For the entire list, see the navigation panel.
F1.6.1 Deriving The QSAR Equation
Step 3 consists of deriving the QSAR equation corresponding to the set of descriptors that were selected in the previous step.
Compounds selection
Descriptors selection
Building the QSAR model
Validating the model
F1.6.2 The Starting Point: The Study Table
The starting point for deriving a QSAR equation is the study table. It consists of a spreadsheet with molecules across the rows and molecular characteristics (biological activity, descriptors) down the columns. Typically, the first column indicates the molecular identification (e.g. compound number or name, 2D structure), the second column its activity, and subsequent columns the values of the corresponding descriptors.
Property of interest
Descriptors
Compound	Activity	LogP	MR	MW	HOMO	Density
i	98	-4.03	87.10	332.2	-12.0	1.47
2	24	-3.68	76.53	324.4	-11.5	1.43
3	28	-4.34	91.23	290.3	-11.2	1.37
4	64	-5.19	100.2	310.1	-9.2	1.36
5	18	-5.59	91.32	291.5	-10.2	1.41
n	52	-4.83	72.12	340.3	-11.3	1.36
F1.6.3 Graphical Analysis of the Data
The study table should lead to graphical analyses. This step is of paramount importance and leaves room for "hunches" and preliminary interpretations. This is where the key questions are asked: is there an order? Are the points distributed according to known patterns? Can the recognized trends be translated into physico-chemical expressions? etc...
descriptor x
descriptor z
descriptor y
descriptor k
F1.6.4 Choice of the Mathematical Equation
After having identified trends in the system, the correlation process can begin. The initial analyses help guide the choice of the right mathematical equation. This equation should not be treated as a black-box; rather it should contain information that reflects the behavior and allows for interpretation of the system in a structural manner. Sound structural informational content in a QSAR equation is of utmost importance for formulating step 3.
F1.6.5 Complexity Levels and Data Overfitting
The next hurdle is the mathematical equation. At this stage the complexity of the model depends on both the form of the mathematical equation and the number of descriptors considered.
single linear regression parabolic model
Activity = a (descriptor) * b        Activity = a (descriptori)2 + b
multiple linear regression:
Activity = a(descriptori)+b(descriptor2)+c(descriptor3)+d...
other models: parabolic, bilinear, probability, equilibrium etc...
F1.6.6 Mathematics are Very (too) Powerful
QSAR models can be skewed unintentionally by overly powerful mathematical choices. An equation that fits the data of a training set precisely can yield an equation that is perfect mathematically but meaningless for molecules other than those in the training set. For example if the training set consists of 20 molecules, it is always possible to select a set of 20 randomly chosen descriptors and solve the mathematical system for 20 equations and 20 unknowns. This error is known as data-overfitting.
20 equations and 20 unknowns
ivity of 1 = ai i dl + ai 2 d2 + ai 3 d3 +......+ ai 20 d2
ctivity of 2 activity of 3 activity of 4
- 02,1 dl * 022 d2 * Q23 d3 + ......       + 02,20 d2
= 03,1 dl + 03,2 d2 + 03,3 d3 + ......       103,20 d2
= 04,1 dl + 04,2 d2 + 04,3 d 3 + ......        + 04,20 d2
biological activities
/
unknowns
I
descriptors
F1.6.7 Illustration with an Example
To illustrate the data-overfitting problem, let's take a series of compounds for which the permeability through the blood brain barrier (BBB) has been found to be correlated with their logP and polar surface area. In the following graph we have plotted a hypothetical series of compounds in this space and color-coded them according to their BBB permeability. Compounds colored green are permeable whereas compounds colored red are not.
• permeable
Polar surface area
F1.6.8 A Simple Model
A linear model for differentiating between BBB permeable and BBB impermeable compounds can be formulated by drawing a straight line through the logP / Polar surface area space. Most of the compounds on the left side of the line are BBB permeable whereas most of the compounds on its right are BBB impermeable. As the model correctly classifies 45 out of the 50 compounds it has a success rate of 90%.
Polar surface area
F1.6.9 A Complex Model
A model with an improved success rate can be generated by drawing a curved line across the logP / Polar surface area space. This model completely separates the BBB permeable compounds from the BBB impermeable compounds and thus has a success rate of 100%.
Polar surface area
F1.6.10 Comparing the Two Models
Which of the two models better distinguishes BBB permeable from BBB impermeable compounds? Clearly the complex model has a higher success rate. However, by doing so it distorts its shape to correctly classify the outliers thereby completely reflecting the scatter of the training data - it is therefore an overfitted model. On the other hand, the simple model mislabels the outliers on the assumption that they are indeed outliers.
outliers
Polar surface area
Polar surface area
F1.6.11 Predictive Power of the Simple Model
The simple model predicts that all test compounds lying to the left of the line are BBB permeable and all those lying to the right of the line are BBB impermeable. Assuming that the test compounds are similar to the training compound, the prediction power of this model is expected to be high.
Predicted to be impermeable and is probably so
Predicted to be"* permeable and is probably so
Polar surface area
F1.6.12 Predictive Power of the Complex Model
The complex model also predicts that all test compounds lying to the left of the line are BBB permeable and all those lying to the right of the line are BBB impermeable. However, under the same assumption of similarity between test compound and training compound, many of its predictions are expected to be erroneous.
Predicted to be permeable but is probably impermeable
Predicted to be ^ impermeable but Is probably permeable
Polar surface area
F1.6.13 Complexity Dictated by Predictability of the Model
In the QSAR approach tailoring an equation to the peculiarities of a training set is not a problem. However, forcing the mathematics to fit too closely to the data may lead to meaningless models in terms of predictability (tools for assessing the predictability of a QSAR model will be presented in Step 4). The real issue is to stop the refinements early enough so that the predictive capabilities of the model are not lost.
Complexity level
F1.6.14 Single Linear Equation: Mathematical Outline
The simplest form of a QSAR equation is a linear model with one descriptor. This simply yields the equation of a straight line of the form y = b0       where b0 indicates the intercept of the line with the y axis and b1 the slope of the
line. bn and b1 are calculated as described on the next page.
F1.6.15 Calculating bO and b1
bi =
£ (Xi-x)(yi-y)
m_
S(Xi-x)2
(=1
bo = y - bix
:ulations are presented for
F1.6.16 Multiple Linear Regression: Mathematical Outline
It is not always possible to correlate biological activities with a single descriptor (linear model with one descriptor). Given that biological action results from the combined influence of many factors, one can extend the QSAR model to multiple descriptors. Indeed, the observation that several parameters used simultaneously can lead to good models prompted the development of a method referred to as "multiple linear regression" (MLR). In this model linearity is maintained for each of the individual descriptors.
coefficients
descriptors
F1.6.17 Example: MLR vs. Single Linear Models
The example of anticonvulsant compounds shown below demonstrates that each descriptor Es, o and logP alone was not able to give a good correlation (r less than 0.40) with the biological activities. However, by using simultaneously logP and a, a significant improvement was made (r=0.80). The addition of Es improves the model even more (r=0.95). This indicates that the biological properties result from the combined action of lipophilicity, steric and electronic effects.
\\//
Y O
model
bad model
good model \^\
log 1/C = 0.009 Es + 3.411
log 1/C = -0.626 a+3.314
log 1/C - -0.078 logP * 3.432
0.03
0.27
0.38
log 1/C = -0.210 logP - 2.214 a ♦ 3.154
0.80
log 1/C = 0.21 Es - 0.238 logP - 3.81 a + 3.046
0.95
F1.6.18 The Mathematics of MLR: a Single Sample
In MLR we try to express activity as a linear combination of descriptors. We recognize the fact that in most cases, our fit to the experimental data will not be perfect and error is usually unavoidable. In the equations listed below, y (the activity) is a scalar; Xj is the value of the descriptor j and bj its associated coefficient; e is the error. In the matrix
notation, xT is a row vector of the descriptors and b, a column vector of their associated coefficients.
y = biXi + b2X2 + b3X3 + ... + bmxm + e
m
V = X bJxJ + e
Matrix notation:     Y=x7b + e
F1.6.19 The Mathematics of MLR: Many Molecules
For the case of multiple compounds, the activity values are assembled into a vector y of length n, where n is the number of compounds. The descriptors are collected into an n by m matrix where n again is the number of compounds and m is the number of descriptors. The coefficients are collected into a vector of length m and the errors are collected into another vector of length n.
y = x b + e
F1.6.20 The Solution of MLR
In the MLR formalism we search for the (unknown) set of coefficients b, which, when multiplied by the (known) descriptors, best approximates the (known) activity data (equation 1). A solution to this problem can be obtained through a matrix inversion procedure (equation 2).
^coefficients • example
y = x b + e (1)
The transposed of the original descriptors matrix. A transposed matrix replaces columns with rows and vice versa.
The M" indicates matrix inversion
b = (xTxy1 t I
xry (2)
The unknown vector        The original The known vector
of coefficients descriptors matrix      of activities
F1.6.21 Analysis of the MLR Equation
One of the purposes of QSAR analyses is to understand the forces governing the activity of a particular class of compounds and to assist drug design. In the example shown below QSAR analyses reveal that the relative importance of the descriptors vary in the following order: logP > a > Es > MR; therefore the biological activities are governed in the first place by hydrophobicity (logP) and polarity (o) and to a lesser extent by steric effects (Es and MR).
Descriptors
logP a        Es MR
F1.6.22 Non-Linear Equations
A non-linear equation is an extension of a multiple linear regression. In some systems the linearity may not be sufficient to achieve a good correlation. Hansch was the first to introduce a parabolic term, and a complex biological process can be satisfactorily modeled by non-linear equations.
F1.6.23 Example of Non-Linear Model
In the example below, the anticonvulsant activities of a set of molecules was initially found to be linearly correlated with logP. However, it is implausible to assume that the biological activity can increase indefinitely by increasing the lipophilicity of the molecules. It is known that highly lipophilic compounds cannot reach their site of action, because they are trapped in lipophilic environments. It is therefore more realistic to improve the initial model using a non-linear equation. The modified equation proved to be correct and revealed the existence of an optimum logP value, information that could not be derived from molecules with a small range of logP values.
%> linear model
• non-linear model
R2
Rl
log (1/C) s 0.73 logP * 2.5
Log P
F1.6.24 Typical Non-Linear Equations
There are many reasons why the use of non-linear models is justified, including the kinetics of the drug transport, the equilibrium control of its distribution, allosteric effects, different pharmacokinetics, metabolism, solubility etc... The following are examples of non-linear models that have proved to be valid at ieast for special and complex biological systems.
Parabolic Model (Hansch)
log 1/C =  a (logP)2 + b logP * c Probability Model (McFarland)
log 1/C =  a logP - 2a log (P+l) + c Equilibrium Model (Hyde)
log 1/C -  a logP - log (aP+1) + c Bilinear Model (Kubinyi)
log 1/C =  a logP - b log (pP+1) + c
F1.7 Validating the Model: Step 4
The topic Validating the Model: Step 4 contains the following 19 pages:
■ Tools for Assessing the Quality of a Model
■ Predictive and non-Predictive Models
■ The Standard Deviation
■ Correlation Index r2
■ The Mathematics of r2
■ TSS, the Total Variance
■ RSS, the Explained Variance
■ t-test for Single Descriptors and Significance of r2
■ Shape of t-distribution and Number of Molecules
■ Student's t-test Procedure
■ F-test for Assessing the Significance of r2
■ Performing the F-test
■ F-test Procedure
For the entire list, see the navigation panel.
F1.7.1 Tools for Assessing the Quality of a Model
Efficient tools are necessary for assessing the validity of a QSAR model. Numerical analyses or statistical methods provide a variety of indexes that serve to evaluate the quality of the model and its limitations. In the following pages we present some of these tools and explain how to use them.
Validating the model
F1.7.2 Predictive and non-Predictive Models
Broadly speaking there are two groups of indices: (1) those that indicate how well the QSAR equation can "reproduce" the experimental data and (2) those that can tell how far the model can be extrapolated to new molecules.
F1.7.3 The Standard Deviation
The easiest way to "validate" a QSAR model is to calculate the standard error or standard deviation (SD or s), which is calculated as the average squared deviation of each number (the "residuals") from the mean. This index reflects how much the deviation between the data and the model is. The smaller the SD, the more the model is considered of good quality.
% s calculation « example
The Equation
F1.7.4 Correlation Index r2		
The most frequently used	index for evaluating the performance of a QSAR model is r2 (squared correlation	
coefficient), r2 measures the degree of correlation between the activity values calculated by the model and those		
measured experimentally. The value of r2 can range between 0 (no correlation) to 1 (perfect correlation).		
»r2=1	• r2=0.5                                  « r2=0	
	r2 = 1	
		perfect correlation
>-		
		
		
o		
<		
		
ft)		^^^^^^
£^		
		
		
a		
ft)		
		
		
	Calculated Activity	
F1.7.5 The Mathematics of r2
Mathematically, r2 is calculated by dividing the fraction of variance explained by the model (the "explained sum of squares", ESS) by the original variance (the "total sum of squares", TSS). ESS, the fraction of variance explained by the model is equal to the total variance (TSS) minus that portion of the variance which was not explained by the model (residual, RSS).
N
F1.7.6 TSS, the Total Variance
F1.7.7 RSS, the Explained Variance
In order to obtain RSS, the variance explained by the QSAR model, we start from the fact that the total variance is the sum of the explained and unexplained variances. Thus, the explained variance is the difference between the total variance and the unexplained variance. That portion of the variance which is left unexplained by the QSAR model (unexplained variance) can be obtained by finding the difference between the measured activity and the predicted activity (as given by the regression line).
71
F1.7.8 t-test for Single Descriptors and Significance of r2
r2 alone is not sufficient to determine whether the relationship has occurred by chance; its significance can be calculated using the t-statistic for single descriptors as follows. We repeat the process of deriving of a QSAR equation and calculate the resulting r2 values many times, each one using a different descriptor. If the number of molecules is large (> 30), the sampling distribution of the resulting r2 values will have a normal (i.e., Gaussian) shape. If the number of molecules is small, it will have a shape known as a t-distribution.
Normal (gaussian) distribution t-distribution
Values on the x-axis represent standard deviations from the mean located at X = 0.
9181
F1.7.9 Shape of t-distribution and Number of Molecules
A value r2 = 1 will always be obtained for a set of two molecules irrespective of the descriptor used for the QSAR analysis however, as the number of molecules increases, the probability of obtaining large r2 values with irrelevant descriptors decreases. This probability corresponds to the area under the t-distribution curve (see below), away from the center (where r2 = 0). The shape of the t-distribution therefore depends on the number of molecules used in the analysis.
C^.C
F1.7.10 Student's t-test Procedure
The Student t-test employs the t-distribution to test whether the correlation coefficient obtained from the QSAR analysis is significantly different from 0. The larger the t-value, the larger the probability that r2 significantly differs from 0; that is, the larger the probability that the descriptor used for the analysis is relevant to the activity. Technically, the steps involved in the Student t-test are as follows.
% Overview C Step 1 • Step 2 * Step 3 • Step 4
1. Calculate t according to the above equation.      t - r
2. Select a significance level (e.g., 0.05). (see step 2)
3. Look up the t value from a t-distribution derived for the correct number of data points (N) at the selected significance level.
4. If the calculated t-value is larger than the listed t-value, then the regression equation is significant at this significance level.
F1.7.11 F-test for Assessing the Significance of r2
The F-test is an extension of the t-test for the case of many descriptors. Like the t-test it tests (and hopefully rejects) the assumption that the model did not explain any of the original variance in the data set (i.e., ESS = 0). Like the t-test, the F-test uses an F-distribution which, similar to the t-distribution depends on the number of compounds and descriptors.
Molecules = 10 Descriptors - 4
0.0     1.7     3.3     5.0     6.7     8.3 10.0
Molecules = 100 Descriptors = 10
0.0     1.7     3.3     5.0     6.7    8.3 10.0
F1.7.12 Performing the F-test
The F-test employs the F-distribution to test whether the correlation coefficient obtained from the MLR analysis significantly differs from 0. The larger the F-value, the larger the probability that r2 significantly differs from 0; i.e. the greater the probability that the descriptor used for the analysis is relevant to the activity. Technically, the steps involved in the F-test are as follows.
ESS RSS rz - and       p2 - 1-
TSS TSS
F s
RSS
TSS =
ess
1 TSS 1-r2 " RSS
ESS r2 RSS l-r;
Calculate F according to this equation:
N - number of molecules, k - number of descriptors
F1.7.13 F-test Procedure
The application of the steps involved in evaluating the significance of r2 for the Capsaicin analogs using the F-test proceeds as follows:
% Procedure
• F-table
Calculate F:       F =
r2(N-k-l) k(l-r2)
; r2 = 0.92; N=8; k=3
0.92(8-3-1)
F =---- = 15.33
3(1-0.92)
Select a significance level (p):       p = 0,01
Look up the F value from an F-distribution with N=7, k = 1, p = 0.01:
F = 7.59
tab
The calculated F value (15.33) is larger than the tabulated F value (7.59). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than 1%.
F1.7.14 Assessing the Predictive Power of a Model
r2, t and F are indices that can be generated to evaluate QSAR results. However, these parameters basically only tell us about the ability of the QSAR model to reproduce the data from which it was derived and not its aptitude to predict the activities of new compounds. Two methods are presented in the following pages to estimate the predictive power of a QSAR model.
F1.7.15 The Test Set Method
The first method is known as the "test set method" and consists of partitioning the initial data into two sets, a preferred strategy when a large set of compounds is available. The initial data set is randomly divided into two parts; the first one is used to build a QSAR model and the second one to validate this model.
Training set
Test set
F1.7.16 The Cross Validation Method
The second method is known as "the cross validation method" - it is preferred when the size of the data set is too small. In this method the data are randomly divided into N equal parts; N-1 parts are used to build the model which is then used for the remaining Nth part to predict the activities of the corresponding molecules. The procedure is repeated until the activities of all compounds have been predicted independently.
Stage 1     Stage 2     Stage 3     Stage 4     Stage 5
Training set
Training set
Training set
Training
Training
Training
Training
Test set
Training
Training
Training
Test set
Training
Training
Training
Test set
Training
Training
Training
Test set
Training
Training
Training
Training
F1.7.17 Limits of the Cross Validation Method
With the cross validation method, the QSAR model that is ultimately used to predict the activities of new compounds is derived from all N data points and is therefore different from the N partial QSAR models (i.e. those derived from the N-1 data points). Therefore cross validation does not provide us with the predictive power of a specific QSAR equation but rather with an estimate of our ability to make predictions for compounds similar to those used in our QSAR analysis.
Compounds used in original QSAR analysis
Prediction estimates made by cross validation apply
Prediction estimates made by cross validation do not apply
Descriptor 1
F1.7.18 The Predictive Index Q2
The predictive power of the model, termed Q2, is computed by analogy with r2, the difference being the use of the PRESS (predicted sum of squares) rather than the RSS (residual sum of squares) in the numerator. PRESS is calculated as the difference between the measured activity and the predicted activity for the test set compounds.
r2 = 1 -
rss
t
Q2 = 1 -
press
rss = ^ (ycoicj - y.)2 i=i
I
N
press =2 (ypred.i - yi):
F1.7.19 Summary
When discussing mathematical tools available for assessing the quality of a QSAR model we saw that (1) the standard deviation is an isolated "absolute" index of local meaning; (2) with r2 it is possible to compare different models, but this index is only mathematical - not statistical; (3) t and F have a statistical content that can be used for single and multiple linear regression respectively; however they only measure the ability of the QSAR model to reproduce the data from which it was constructed.
log£= 1.14 log P + 0.16
correlation coefficient for assessing the quality of the model
F-value for assessing the statistical significance
n = 25; rz=0.91; s = 0.155; F = 66.4; Q2 = 0.875
number of molecules
regression coefficient for measuring the predictibility
standard deviation
F1.8 Example of Simple Linear Regression
The topic Example of Simple Linear Regression contains the following 11 pages:
" Example of Capsaicin Analogs
' Relevant Descriptors of Capsaicin Analogs
■ The Capsaicin Study Table
" Graphical Analysis of Capsaicin Analogs
■ Deriving a QSAR Linear Equation
' Experimental vs. Calculated Values
■ Calculating r2 for the Capsaicin analogs
■ t-test for the Capsaicin Analogs
■ F-test for a Series of the Capsaicin Analogs
■ The QSAR Equation for the Capsaicin Analogs
■ Predicting the Activities of Unknown Compounds
F1.8.1 Example of Capsaicin Analogs
Capsaicin analogs were studied for their analgesic properties and we will use this study to illustrate the derivation of a simple QSAR model. The biological activities (EC50) were measured for some analogs as indicated below. The
question is whether on the basis of these data, it is possible to develop a QSAR model and predict the biological activities of new compounds.
F1.8.2 Relevant Descriptors of Capsaicin Analogs
The selection of descriptors that correlate with the target biological activity is mandatory for the derivation of a meaningful QSAR model. For Capsaicin analogs, biological activity appears to be influenced by the lipophilicity of the substituent R. Following this assumption the descriptors deemed most suitable are the molar refractivity (MR) and the hydrophobic substituent constant tt.
Lipophilicity Descriptors TT : encodes the lipophilic behavior
MR: contains information on the volume
F1.8.3 The Capsaicin Study Table
The following table summarizes the MR and tt values which were calculated for the seven Capsaicin analogs. As discussed above, activities (EC50) are expressed as their log values.
Compound	log EC50	71	MR
I j	1.07	0	1.03
2	0.09	0.71	6.03
3	0.66	-0.28	7.36
4	1.42	-0.57	6.33
5	-0.62	1.96	25.36
	0.64	0.18	15.55
7	-0.46	1.12	13.94
F1.8.4 Graphical Analysis of Capsaicin Analogs
For Capsaicin analogs, if we plot the values from the study table for MR and tt, respectively, there seems to be a weak correlation between the biological activity and the molar refractivity (MR). However, the hydrophobic substituent constant tt shows a possible linear correlation.
F1.8.5 Deriving a QSAR Linear Equation
The correlation between tt and the biological activities is represented by the equation y = bg+^X, where b0 is the intercept of the line with the y axis and b1 the slope of the line. We show below how to calculate their numerical values.
bi =
I (Xi-x)(yry)
i=l
S(Xi-x)2
1=1
bo = y - bix
F1.8.6 Experimental vs. Calculated Values
There is a difference between the experimental and the calculated values as shown below, continue
■
F1.8.7 Calculating r2 for the Capsaicin analogs
For Capsaicin analogs, r2 is calculated as follows.
—   1.07 + 0.09 * 0.66 + 1.42 + (-0.62) + 0.64 + (-0.46)
Y =-j- = 0.4
TSS = (1.07-0.4)2 + (0.09-0.4)2 + (0.66-0.4)2 + (1.42-0.4)2 * (-0.62-0.4)2 + (0.64-0.4)2 + (-0.46-0.4)z = 3.49
RSS = (0.28)2 + (-0.12)2 + (-0.36)2 + (0.16)* * (0.19)2 * (-0.01)2 + (-0.34)2 = 0.40
2      3.49 - 0.40   3.09 -P   =       3.49      = 3.49
F1.8.8 t-test for the Capsaicin Analogs
The steps involved in evaluating the significance of r2 are as follows
t calculation
• t-table
Calculate t:
t = r.
N-2
^=0.89; N-7
t - J0.89
7-2 1-0.89
Select a significance level (p)
Look up the t value from a t-distribution with A/=7, p=0.01:
The calculated t value (6.3604) is larger than the tabulated f value (2.998).  Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than 1%.
F1.8.9 F-test for a Series of the Capsaicin Analogs
The steps involved for evaluating the significance of r2 using the F-test proceed as indicated below. The F-test analyses finally indicate that a significant correlation is obtained and the probability of a chance correlation is less than1%.
% F calculation
« F-table
Calculate F:
f =
r2(N-k-l) k(l-r2)
; r2 = 0.89; N=7; k=l
0.89(7-1-1)
F =---- = 40.45
1(1-0.89)
Select a significance level (p):        p = 0,01
Look up the F value from an F-distribution with N=7, k = 1, p = 0.01:
F = 12.25
The calculated F value (40.45) is larger than the tabulated F value (12.25). Thus, the correlation is significant at this level. The probability that the correlation is fortuitous is less than \%.
F1.8.10 The QSAR Equation for the Capsaicin Analogs
QSAR studies reveal the importance of lipophilicity in the analgesic properties of a series of Capsaicin analogs as indicated by the good correlation found with the tt descriptor. The correlation coefficient r2 is 0.89 and analyses of the significance of the equation (t-test and F-test) show that there is less than a 5% chance that the relationship is due to chance. This validates the use of tt as a descriptor for the structure-activity relationships.
r2=0.89; s=0.28; t=6.36; F=40.45
F1.8.11 Predicting the Activities of Unknown Compounds
The derived QSAR model can be used to predict the biological activities of novel capsaicin analogs by introducing their corresponding tt values in the QSAR equation. For example, the biological activity of the amide anajog indicated below is predicted with an EC50 of 0.98 uM.
71