UK-QSAR Spring 2024 meeting

UK-QSAR Spring 2024 meeting: “The Nuts and Bolts of AI: Model Derivation and Application”
Thursday April 11th 2024

The Petersfield Lecture Theatre,
The Cambridge Building,
Babraham Research Campus,
Babraham, Cambridge CB22 3AT

Charm Therapeutics and Chemical Computing Group are very happy to welcome you to the Spring 2024 UK-QSAR meeting, to be held on 11th April 2024 at the Babraham Research Campus near Cambridge.

The theme for the meeting is “The Nuts and Bolts of AI: Model Derivation and Application”. Artificial Intelligence (AI) and Machine Learning (ML) are phrases we hear a lot these days; they bring together techniques that our community has been using for decades. There is a danger that “AI” may be treated as a “black box”, with a great deal of trust being placed in the results of its application, whereas how models are derived and trained of course has a huge influence on their output. We must therefore consider how models are produced when defining their applications, and how much weight we give their conclusions. Our talks for today will focus on these issues.

As ever, the meeting is free of charge to attend, but registration in advance is essential – “walk-up” attendance is not possible.


Registration is now closed.


Poster abstract submission is now closed.

Please follow these links to download the posters list and posters abstracts.

Location and Transport

The Babraham Research Campus is located 9km south east of Cambridge city centre, next to the A1307 Cambridge to Haverhill road, and just off the A11.

For those arriving at Cambridge’s central railway station, we recommend that people use the number 13 bus service, provided by Stagecoach, which runs from Cambridge (Drummer Street bus station) to Haverhill on a half-hourly basis with stops close to where Station Road joins Hills Road, and close to the access road roundabout for the Babraham Research Campus (station name is “Cambridge lodge”). Cost is £2. For more information and live timetable 13 Bus Route & Timetable: Cambridge – Haverhill | Stagecoach (

Places on the shuttle bus organised from and to Cambridge’s central railway station are no longer available.

A taxi from / to Cambridge central station will cost around £20 each way.

There is ample car-parking on site – drivers should provide the registration number of their vehicle when registering for the event. There is also a dedicated cycleway from central Cambridge adjacent to the A1307, all the way out to Babraham.


10:00-10:15Welcome Remarks
10:15-10:45Lewis Mervin and Gian Marco Ghiandoni (AstraZeneca) – “QSARtuna: Enabling high-quality production predictive AI/ML at AstraZeneca”
10:45-11:15Michael Parker (Optibrium) – “Rapid AI generation of optimised compound designs guided by user interactions”
11:15-11:45Rachael Skyner (OMass Therapeutics) – “Forget about the model ¬– what about the data?”
11:45-12:15Damjan Krstajic (Serbian Research Centre for Cheminformatics) – “Can predictive models admit when they don’t know?”
13:45-14:15Madeleine Taylor (Strathclyde University) – “Development of molecular descriptors for quantitative structure-retention relationships”
14:15-14:45Markus Kossner (Chemical Computing Group) – “Reverse Fingerprinting: Application to Motif Detection and Pharmacophore Query Generation”
15:15-15:45Finlay MacLean (Charm Therapeutics) – “Seedling: a scoring and generation framework for protein-ligand co-folding”
15:45-16:15Fernanda Duarte (University of Oxford) – “Bridging the Gap: Domain Adaptation for Heterocycle Retrosynthesis Prediction”


QSARtuna: Enabling high-quality production predictive AI/ML at AstraZeneca

Lewis Mervin1, Gianmarco Ghiandoni2

1Molecular AI, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK

2Augmented DMTA Engineering, R&D IT, R&D, AstraZeneca, Cambridge, UK

Model development and engineering are tightly coupled phases in the field of predictive AI/ML for drug discovery. They are in fact critical to its uptake in medicinal chemistry projects which require both accuracy and performance when dealing with real-world data. Here we present QSARtuna, an automated model building framework for molecular property prediction that we have built in-house at AstraZeneca, and describe the ways in which models are deployed, consumed, and combined with other software in practice (e.g., for de novo design using REINVENT). Impact examples are also discussed to show how our tools are used to progress compounds in projects. To conclude, we provide an outlook on the future of predictive AI/ML to enable actionable decision making in drug discovery.


Rapid AI generation of optimised compound designs guided by user interactions

Michael Parker, Optibrium, Cambridge, UK

We present a novel AI approach for generative chemistry, rapidly generating new compound designs with improved properties. We pair a generative transformer model with a Bayesian optimisation algorithm to identify desirable property changes from user interactions and generate new compound ideas meeting those criteria. We show that the model can identify user goals within a multi-dimensional parameter space within a few interactions, and successfully generate relevant, optimised compound designs meeting multi-parameter goals. This powerful combination allows chemists to obtain new AI generated compounds quickly and easily, tailored to their project goals, without having to spend time defining complex filters and multi-parameter property criteria.


Forget about the model – what about the data?

Rachael Skyner,1 Elliot Nelson1 and Dominga Evangelista2

1OMass Therapeutics, Building 4000, Chancellor Court, John Smith Drive, Oxford Business Park, ARC, Oxford, OX4 2GX; †

2 Department of Pharmacy and Biotechnology, University of Bologna, Via Belmeloro 6, 40126, Bologna, Italy

In cheminformatics, the focus is often on refining predictive models and applying new predictive techniques to unsolved challenges. However, the quality and appropriateness of the underlying data are critical for robust and meaningful outcomes. We will delve into the intricacies of data utilisation; challenging the conventional use of benchmarking datasets straight out of the box.

First, we explore the pitfalls of adopting benchmarking datasets without scrutiny. We discuss effective techniques to identify and interrogate data, shedding light on known issues with widely used benchmarks. By advocating for a critical examination of datasets, we hope to help others to consider and identify problems in pre-prepared datasets and therefore enhance the reliability of their results.

Navigating novel or less-investigated prediction problems without a pre-curated dataset can be especially challenging. The second part of this talk presents case studies and general guidelines for building datasets, utilising examples such as dataset curation from Chembl. Through these insights, we hope to provide valuable solutions and starting points for cases where a more tailored dataset is required.

The discussion extends to diverse approaches for preparing data to ensure unbiased training and testing. By examining various methods, we aim to equip researchers with a toolkit to enhance the fairness and generalizability of their models.

In conclusion, this talk aims to emphasise the significance of thoughtful data curation, interrogation, and processing; providing practical guidelines and shifting focus from model-centric thinking to data-centric thinking as a starting point for model building in cheminformatics.


Can predictive models admit when they don’t know?

Damjan Krstajic, Director, Research Centre for Cheminformatics, Belgrade, Serbia

We are of the opinion that during the design of a binary classifier one ought to consider adding an “I don’t know” answer. We provide the case for the introduction of this third category when a human needs to make a decision based on the answer from a binary classifier. We discuss several approaches that may be used in this scenario. A procedure to define “I don’t know” predictions in binary classifiers, called all leave-one-out models (ALOOM), is presented as a proof of the concept.


Development of molecular descriptors for quantitative structure-retention relationships

Madeleine Taylora, Roman Szucsb, Lucy Morganc, Roland Brownc, David Palmera

aPure and Applied Chemistry, University of Strathclyde, G1 1XL, UK; bDepartment of Analytical Chemistry, Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia; cPfizer Global R&D, Sandwich, UK.

QSRR models are widely used in the pharmaceutical industry to help identify unknown compounds in HPLC screening experiments1. These models rely on high quality, relevant descriptors. Traditional descriptors are focused on solute features, but chromatographic retention is a phenomenon defined by solvation and partition interactions. Therefore, new molecular descriptors are developed that describe solvation structure using the reference interaction site model (RISM). The usefulness of these descriptors has been proven previously for predictions of solvation free energy2, entropy of solvation, and enthalpy of solvation3. They are adapted for chromatography by modelling the chromatographic conditions including the mobile phases and an analogue of the stationary phase. Together, these describe the dynamic equilibrium in the column.

Datasets provided by Pfizer have been used to validate these descriptors. 1D RISM equations were solved for the analyte molecules in various solvents with pyRISM solver software4. Compared to a PLS model using Mordred descriptors alone, the addition of RISM descriptors for solutes in methanol increased R2 from 0.515 to 0.717, and decreased RMSD from 1.01 to 0.77 min. Additionally, an outlier with atypical chemical structure had its percentage error reduced from 97% to 34% with the addition of these physics-based descriptors. Together, the descriptors display synergy which suggests that the information provided by RISM descriptors is complementary to that provided by the standard 2D descriptors. The QSRR methods to utilise these descriptors are also being developed, including more advanced machine learning algorithms and a procedure to predict differences in retention times.

[1] P. R. Haddad, M. Taraji and R. Szücs, Anal. Chem., 2021, 93, 228–256.

[2] D. S. Palmer, M. Mišin, M. V. Fedorov and A. Llinas, Mol. Pharm., 2015, 12, 3420–3432. [3] D. J. Fowles and D. S. Palmer, Phys. Chem. Chem. Phys., 2023, 25, 6944–6954.

[4] A. Ahmad, 2AUK/pyRISM, DOI: 10.5281/zenodo.7783600, 2023.


Reverse Fingerprinting: Application to Motif Detection and Pharmacophore Query Generation

Markus Kossner, Principal Scientist & Scientific Services Manager, Chemical Computing Group, Köln, Germany

‘Reverse Fingerprinting’ is a method that uses feature list fingerprints to detect differentiating structural elements in small molecule and protein datasets. This talk covers the theory of reverse fingerprinting and presents examples of its application to detecting important structural motifs, coloring atoms by activity contribution, generating 3D pharmacophore queries and identifying liability regions in protein structures.


Seedling: a scoring and generation framework for protein-ligand co-folding

Finlay MacLean, Charm Therapeutics, Cambridge, UK

In place of manually designing molecules one-by-one, generative chemistry approaches such as REINVENT promise to aid medicinal chemists to probe the chemical space associated with their design hypotheses. While much work has been done on developing de novo molecular generation algorithms based on machine learning, current approaches rely on simple scoring functions such as QSAR models, docking, and simple properties such as QED and logP. This greatly limits the power of such methods.

At Charm we have built a state-of-the-art molecule optimisation platform, Seedling, that incorporates molecular dynamics as well as our proprietary protein-ligand co-folding algorithm, DragonFold, to perform structure-based molecular generation.

In Seedling, a suite of generators ‘grow’ chemically reasonable designs based on “seed” molecules. Using a distributed platform to perform thousands of computations in parallel, these designs are then scored by an arsenal of physics-based simulations and machine learning models. Molecules are selected for expensive simulations via an active learning strategy, allowing us to efficiently search chemical space for the most promising molecules. This platform enables our experts to formulate complex hypotheses and return the next day to evaluate the most promising ideas.


Bridging the Gap: Domain Adaptation for Heterocycle Retrosynthesis Prediction

Fernanda Duarte, University of Oxford

Heterocycles are important scaffolds in medicinal chemistry that can modulate drug binding and pharmacokinetic properties. Despite their importance, existing datasets on heterocyclic compounds often lack information on how to actually make them, making it challenging to access novel heterocycles. While retrosynthetic prediction models have emerged as promising approaches to assist synthetic chemists, their performance is poor for heterocycle formation reactions due to low data availability.

In this talk, I discuss our efforts to overcome the low data availability problem and improve the performance of retrosynthesis prediction models for ring-breaking disconnections. We explore four different methods to improve these models by leveraging transfer learning techniques, reaching >60% predictions that are both chemically valid and involve breaking a ring. We illustrate the applicability of this model by successfully recreating the synthesis routes of drug-like compounds recently published.


Please note that paper copies of the agenda and abstracts will not be provided onsite, so we recommend that you bookmark this page on your device.

In case of any questions, please contact Steve Maginn at

We’re looking forward to welcoming you to the Babraham Research Campus on April 11th!