Chemistry Patent Data in the Public Domain
Anne Hersey, EMBL-EBI Hinxton
At the end of last year EMBL-EBI took over the running and maintenance of the chemistry patent database, SureChem, previously developed by Digital Science. This means that this resource, which we have renamed SureChEMBL, will now be freely available to everyone via this link: http://www.ebi.ac.uk/surechembl. The ChEMBL Group is particularly excited by this acquisition and we hope you will be too. Ever since we started visiting people and talking at conferences etc. about ChEMBL, the question we were constantly asked was whether we had plans to extract chemistry data from patents – so here it is.
How does it work? SureChEMBL takes continuous feeds of data from the main patent offices and uses “name to structure” and “image to structure” software to identify the chemical entities in the full patent text. These are then stored in a chemistry-aware database and are available for substructure and similarity searching, as well as text-based searches. For compounds found from a search, SureChEMBL provides links to the full-text patent pdfs.
So how many compounds are in SureChEMBL? The full database contains about 15 million structures and of these there are about 5.3 million structures that have the following “drug-like” properties: Molecular weight between 300 and 800, contain at least one ring, have <=15 rotatable bonds, have no bad valencies, do not contain undesirable structural toxicophores and occur in the annotated patent corpus <100,000 times.
Our first priority is to complete the migration of the various components of SureChEMBL and then to give users full access to the system. We envisage this will take a few more months and then we will then work on an EBI look and feel for the interface etc.
We have lots of ideas for future developments including tagging the patents with information about diseases and targets, back filling the database with pre 2006 chemistry data extracted from images and enhancing the access using workflow tools. We are still exploring the potential of the database ourselves and, to a certain extent, we will do what our users want (and our funding allows) so we would appreciate your feedback and suggestions for SureChEMBL’s future development. You can find out more about SureChEMBL here.