A basic introduction to QSAR

The slideshow below shows some fundamental information about QSAR. From this page you can learn what a QSAR model is and how model developers build them. You can visualize the presentation here but you can also download it as a PPT file on your computer.


The methods called quantitative structure-activity relationship (QSAR) are based on the assumption that the activity of a certain chemical compound is related to its structure.

More precisely, this approach says that the activity, or the property, for instance the toxic effect, is related to the chemical structure through a certain mathematical algorithm, or rule. QSAR models are also called in silico methods, which actually refer to a somehow broader set of methods.


These methods are real scientific methods and not science fiction.


There is nothing magic about these methods.


The basic assumption can be illustrated with this example. Let’s consider these pumpkins. We can distinguish them on the basis of their size, or shape, or weight. It means that we take into consideration some general characteristics of the pumpkins (photograph courtesy of G. Gini).


Let’s consider this particular pumpkin. In this case, what makes this pumpkin different from another one is some particular parts of the pumpkin, that makes this an Hallowing pumpkin (photograph courtesy of G. Gini).


In the case of chemical compounds the ways to consider their structure for the QSAR models is similar to that seen for the pumpkins:

we can consider the presence of a particular characteristic in the chemical compound, which is present or not in the structure. For instance, it is well known that if in the chemical compound there are certain groups, like an aromatic amine, or an epoxide, there is a higher probability that the chemical compound is genotoxic. In this figure, Ashby draw an hypothetical chemical compound with a series of fragments related to higher genotoxic effect.


In other cases, it is well know that toxic effect of chemical within a series of very similar compounds varies in a linear way. For instance, the toxic effect of the chemicals in the figure increases with the molecular weight (MW).


The typical way to derive QSAR model is here represented. The basic assumption is that there is a mathematical function of the chemical properties which is related to the effect.

Thus, the effect called y is a function called f of the chemical properties, called x. Mathematically, y = f(x). But how to find this mathematical algorithm f(x)? Typically, we use a number of chemical compounds with know values of the toxic effect (y). For each chemical compound we calculate a series of parameters, called chemical descriptors. Then, we find an algorithm that provides a quite accurate value, similar to the real experimental value. The final step is to check if the so-obtained algorithm is capable to predict the property values for other chemicals, not used to build up the model. This last phase is called validation of the QSAR model.

This last phase is very important. Indeed, it is very important to generate a model which is working not only for the chemical substances used within the training set, but also for other chemicals. The challenge is to define the correct statistical properties of the model.


There are many chemical descriptors, and we can calculate thousands of them. Here we list some of the major families of the chemical descriptors.


Also for the algorithms we can use many different kinds of mathematical tools. We only mention that some models serve to split the compounds in classes, for instance toxic or not toxic, while other methods serve to calculate a continuous value, such as the toxic dose of a chemical (regression methods).


In the USA there is a long tradition on the use of QSAR models, since the authorities have to evaluate the chemical substances within a relatively short period of time


In the USA the use of QSAR models as support to the evaluation of the chemical properties is quite common.


In Europe the recent REACH legislation gave new emphasis on the possible use of QSAR models to characterize the properties of chemical substances. REACH is the legislation that regulate the market of chemical substances, produced and imported.

All chemical substances require characterization of their environmental and toxicological properties, before their use in Europe. This is a huge effort which requires chemical industry to notify the properties of their chemicals. REACH, since its first article, promotes innovation, and mentions that new, alternative methods should be searched to improve the evaluation of the chemicals. QSAR methods are included in the alternative methods.


REACH wants to protect the human beings and the environment. The reason why innovation is mentioned, is that regulators are aware that existing methods to evaluate the effects on health and environment are mainly in vivo (animal) methods. However, the current methods are not sufficient to get appropriate knowledge on effects on human beings and environment. Alternative methods, including in silico models, and QSAR, may aim to mimic the in vivo methods, but may also be used to cover the knowledge gap which today exist.


Sometimes it is said that QSAR models represent a way for industry to spend less for toxicological research, or can be used to save animals to be used for experiments. It is also added that the real target of REACH is the protection of health and environment, and for this purpose other methods are more reliable. Actually this position is in disagreement with what indicated by REACH. As we have seen, innovation is a milestone for REACH, and REACH states to use all pieces of information, in order to get a more reliable evaluation. The real challenge is not to indentify the best method to protect human beings and environment. The challenge is to take advantage of all the contributions that each approach, in vivo, in vitro, and in silico, offer. For this reason it is important to understand the multiple advantages offered by QSAR models. There at least seven reasons to use QSAR.

1. Innovation. We already discussed this. Very soon a huge amount of data, on 100,000 chemicals, will become available, through USA initiatives, like Tox21 and ToxCast. To analyse manually such a huge amount of data is obviously impossible, and thus computer, in silico methods, will represent a privileged way to explore these new values.

2. The time necessary for the experiments, in particular for the chronic endpoints, would be very long. QSAR methods are very fast.

3. The number of laboratory in Europe to make all the necessary studies may be insufficient. For instance, a recent evaluation done within ANTARES identified this lack of suitable laboratories in Italy.

4. The cost to perform all in vivo experiments would be very high, billions of euros. QSAR methods are often free.

5. Millions of animals should be used to get the necessary data. This raises ethical issues.

6. The regulators have to identify the most risky compounds, and check the dossier submitted by industries. Many tens of thousands of chemicals will be submitted to ECHA, the Agency responsible for the registration process. However, only a limited number of dossiers will be evaluated by regulators, about 5%. It means that the toxicity properties submitted by industries will not be checked for 19 substances, out of 20 chemicals registered. QSAR methods may offer tools to screen all registered chemicals, and identify in this way the chemicals which are predicted more toxic.

7. So far chemical industry addressed the development of new chemicals only on the basis of the target properties for the candidate new products. Critical issues, such as toxicity, have been analysed in a second stage, to comply to regulations. QSAR methods offer tools to incorporate the process of the evaluation of the toxic properties since the beginning of the planning of new compounds, within a pro-active strategy, minimizing the impact of chemicals on the environment and human beings, and reducing the economic resources due to the development of chemicals without the knowledge on their toxicological and environmental properties.


There are several purposes to use QSAR for REACH. Results of QSAR models can be used to get data for the dossier for registration of the chemical substance, or for classification and labeling of the substance, or to prioritize the chemicals submitted according to their risk, as we discussed above.

For these different purposes different QSAR methods can be used, and different requirements apply, on the needed reliability. Indeed, to prioritize chemicals we can use models even if we are 100% sure of their reliability, because the risky chemicals will be evaluated later on using other methods. Conversely, if the value obtained by the QSAR method has to be used to characterize the properties of the chemical substance for the dossier on that substance, much more reliable values should be used.


Thus, QSAR models can be quite easily used for prioritization.


More requirements exist in case of use of QSAR models for classification and labeling. However, we have to consider that industry may simply state that no data is available for a certain property. Thus, the value obtained by a QSAR model, even if with some uncertainty in its reliability, in most cases represent a better guess than no guess at all.


As we said, to prepare the dossier with the requested data, the quality of the values has to be good. In this case we can consider the QSAR value as the single source of data, or as additional source of data. We agree that QSAR models are not reliable as unique source of data in many cases. We believe that in some cases, under defined conditions, values from QSAR models can be used as single source. In any case, we also think that values from QSAR models represent a valid source of additional data in many more cases.


The reliability of the QSAR models strongly depends on the property. In case of physico-chemical properties, mainly in cases where there are thousands of data, good models exist. For this kind of endpoint, the uncertainty of the experimental data is limited. In case of more complex endpoints, and in particular for chronic endpoints for human toxicity, the number of data is more limited, and the uncertainty of the experimental values is much higher.


Thus, the reliability of a QSAR model depends on the purpose we want to use it (for dossier preparation, for classification and labeling, or for prioritization) and on the data we use for the model. The QSAR model is like a bridge: we want to start from one side, with some data, and we want to reach the other side, which is not unique, but depends on the purpose, as we said.


Certain QSAR models are quite risky, and only experts may use them, like some risky bridges.


For some quite standardized, simple QSAR models, the reliability is quite high.


The target of ANTARES is to evaluate the basis of the QSAR models, and identify a number of reliable QSAR models.