Abstracts & Pre-Reading Material
Some speakers have provided the following abstracts and references which might be of interest ahead of their talks.
Qptuna: Enabling high-quality production predictive AI/ML at AstraZeneca
Lewis Mervin1, Gianmarco Ghiandoni2
1Molecular AI, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
2Augmented DMTA Engineering, R&D IT, R&D, AstraZeneca, Cambridge, UK
lewis.mervin1@astrazeneca.com
Model development and engineering are tightly coupled phases in the field of predictive AI/ML for drug discovery. They are in fact critical to its uptake in medicinal chemistry projects which require both accuracy and performance when dealing with real-world data. Here we present Qptuna, an automated model building framework for molecular property prediction that we have built in-house at AstraZeneca, and describe the ways in which models are deployed, consumed, and combined with other software in practice (e.g., for de novo design using REINVENT). Impact examples are also discussed to show how our tools are used to progress compounds in projects. To conclude, we provide an outlook on the future of predictive AI/ML to enable actionable decision making in drug discovery.
=====================
Rapid AI generation of optimised compound designs guided by user interactions
Michael Parker, Optibrium, Cambridge, UK
We present a novel AI approach for generative chemistry, rapidly generating new compound designs with improved properties. We pair a generative transformer model with a Bayesian optimisation algorithm to identify desirable property changes from user interactions and generate new compound ideas meeting those criteria. We show that the model can identify user goals within a multi-dimensional parameter space within a few interactions, and successfully generate relevant, optimised compound designs meeting multi-parameter goals. This powerful combination allows chemists to obtain new AI generated compounds quickly and easily, tailored to their project goals, without having to spend time defining complex filters and multi-parameter property criteria.
========================
Forget about the model – what about the data?
Rachael Skyner,1 Elliot Nelson1 and Dominga Evangelista2
1OMass Therapeutics, Building 4000, Chancellor Court, John Smith Drive, Oxford Business Park, ARC, Oxford, OX4 2GX; †rachael.skyner@omass.com
2 Department of Pharmacy and Biotechnology, University of Bologna, Via Belmeloro 6, 40126, Bologna, Italy
In cheminformatics, the focus is often on refining predictive models and applying new predictive techniques to unsolved challenges. However, the quality and appropriateness of the underlying data are critical for robust and meaningful outcomes. We will delve into the intricacies of data utilisation; challenging the conventional use of benchmarking datasets straight out of the box.
First, we explore the pitfalls of adopting benchmarking datasets without scrutiny. We discuss effective techniques to identify and interrogate data, shedding light on known issues with widely used benchmarks. By advocating for a critical examination of datasets, we hope to help others to consider and identify problems in pre-prepared datasets and therefore enhance the reliability of their results.
Navigating novel or less-investigated prediction problems without a pre-curated dataset can be especially challenging. The second part of this talk presents case studies and general guidelines for building datasets, utilising examples such as dataset curation from Chembl. Through these insights, we hope to provide valuable solutions and starting points for cases where a more tailored dataset is required.
The discussion extends to diverse approaches for preparing data to ensure unbiased training and testing. By examining various methods, we aim to equip researchers with a toolkit to enhance the fairness and generalizability of their models.
In conclusion, this talk aims to emphasise the significance of thoughtful data curation, interrogation, and processing; providing practical guidelines and shifting focus from model-centric thinking to data-centric thinking as a starting point for model building in cheminformatics.
========================
Can predictive models admit when they don’t know?
Damjan Krstajic, Director, Research Centre for Cheminformatics, Belgrade, Serbia
We are of the opinion that during the design of a binary classifier one ought to consider adding an “I don’t know” answer. We provide the case for the introduction of this third category when a human needs to make a decision based on the answer from a binary classifier. We discuss several approaches that may be used in this scenario. A procedure to define “I don’t know” predictions in binary classifiers, called all leave-one-out models (ALOOM), is presented as a proof of the concept.
========================
Development of molecular descriptors for quantitative structure-retention relationships
Madeleine Taylora, Roman Szucsb, Lucy Morganc, Roland Brownc, David Palmera
aPure and Applied Chemistry, University of Strathclyde, G1 1XL, UK; bDepartment of Analytical Chemistry, Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia; cPfizer Global R&D, Sandwich, UK.
QSRR models are widely used in the pharmaceutical industry to help identify unknown compounds in HPLC screening experiments1. These models rely on high quality, relevant descriptors. Traditional descriptors are focused on solute features, but chromatographic retention is a phenomenon defined by solvation and partition interactions. Therefore, new molecular descriptors are developed that describe solvation structure using the reference interaction site model (RISM). The usefulness of these descriptors has been proven previously for predictions of solvation free energy2, entropy of solvation, and enthalpy of solvation3. They are adapted for chromatography by modelling the chromatographic conditions including the mobile phases and an analogue of the stationary phase. Together, these describe the dynamic equilibrium in the column.
Datasets provided by Pfizer have been used to validate these descriptors. 1D RISM equations were solved for the analyte molecules in various solvents with pyRISM solver software4. Compared to a PLS model using Mordred descriptors alone, the addition of RISM descriptors for solutes in methanol increased R2 from 0.515 to 0.717, and decreased RMSD from 1.01 to 0.77 min. Additionally, an outlier with atypical chemical structure had its percentage error reduced from 97% to 34% with the addition of these physics-based descriptors. Together, the descriptors display synergy which suggests that the information provided by RISM descriptors is complementary to that provided by the standard 2D descriptors. The QSRR methods to utilise these descriptors are also being developed, including more advanced machine learning algorithms and a procedure to predict differences in retention times.
[1] P. R. Haddad, M. Taraji and R. Szücs, Anal. Chem., 2021, 93, 228–256.
[2] D. S. Palmer, M. Mišin, M. V. Fedorov and A. Llinas, Mol. Pharm., 2015, 12, 3420–3432. [3] D. J. Fowles and D. S. Palmer, Phys. Chem. Chem. Phys., 2023, 25, 6944–6954.
[4] A. Ahmad, 2AUK/pyRISM, DOI: 10.5281/zenodo.7783600, 2023.
=====================
Reverse Fingerprinting: Application to Motif Detection and Pharmacophore Query Generation
Markus Kossner, Principal Scientist & Scientific Services Manager, Chemical Computing Group, Köln, Germany
‘Reverse Fingerprinting’ is a method that uses feature list fingerprints to detect differentiating structural elements in small molecule and protein datasets. This talk covers the theory of reverse fingerprinting and presents examples of its application to detecting important structural motifs, coloring atoms by activity contribution, generating 3D pharmacophore queries and identifying liability regions in protein structures.
=====================
Seedling: a scoring and generation framework for protein-ligand co-folding
Finlay MacLean, Charm Therapeutics, Cambridge, UK
In place of manually designing molecules one-by-one, generative chemistry approaches such as REINVENT promise to aid medicinal chemists to probe the chemical space associated with their design hypotheses. While much work has been done on developing de novo molecular generation algorithms based on machine learning, current approaches rely on simple scoring functions such as QSAR models, docking, and simple properties such as QED and logP. This greatly limits the power of such methods.
At Charm we have built a state-of-the-art molecule optimisation platform, Seedling, that incorporates molecular dynamics as well as our proprietary protein-ligand co-folding algorithm, DragonFold, to perform structure-based molecular generation.
In Seedling, a suite of generators ‘grow’ chemically reasonable designs based on “seed” molecules. Using a distributed platform to perform thousands of computations in parallel, these designs are then scored by an arsenal of physics-based simulations and machine learning models. Molecules are selected for expensive simulations via an active learning strategy, allowing us to efficiently search chemical space for the most promising molecules. This platform enables our experts to formulate complex hypotheses and return the next day to evaluate the most promising ideas.
|