George Papadatos, ChEMBL group, EMBL-EBI, European Bioinformatics Institute, Hinxton, UK
Workflow systems have now become an important everyday tool, which allows both experimental and computational scientists to deal with the current data explosion. Such specialised workflow systems include both commercial implementations, such as BioVia’s Pipeline Pilot, as well as freely available or Open Source ones, such as Taverna, Kepler, Galaxy, Orange and KNIME. The versatility of these tools make them ideal for the inherently decentralised world of contemporary life science research, including drug discovery, systems biology and the various “-informatics” and “-omics” disciplines.
The main advantage of KNIME (KoNstanz Information MinEr) is that it provides a user-friendly and intuitive graphical user interface, which enables the user to generate and store complex workflows for data mining, analytics and decision making, with little or even no need for computer programming skills. Instead, the user can opt to select standardised nodes from a node repository and connect them together, in a process also known as “visual programming”. The application domains vary widely from standard data manipulation to sophisticated text, image and graph mining, along with machine learning, computational chemistry and chemoinformatics. In the context of industrial drug discovery, this makes KNIME an attractive platform for computational chemists, who use it to glue together diverse and disparate types of available resources, such as databases, flat files, web services, third party software applications and legacy command line tools, and deploy prototype workflows and tools to medicinal chemists. These end users can then easily modify, review and run these workflows on their own, thus making KNIME a “common ground” between the two disciplines; a currently well-established paradigm in several pharmaceutical companies.
Another use of KNIME is that it serves as a regulated framework where academic, industrial and not-for-profit groups can develop their tools and algorithms and share them freely with the scientific community in the form of node collection contributions. Novartis (RDKit), Eli Lilly (Erl Wood Knime Posts), Vernalis (Vernalis Knime nodes), Max Planck Institute (MPI tools), OpenPHACTS (OpenPHACTS Knime Resource) and EMBL-EBI (EBI Knime Extensions) are such examples of popular community contributions in the life sciences domain. At the same time, commercial software vendors often wrap their algorithms as KNIME nodes and license them to their customers. Having standardised nodes together with easily shared and transparent workflows on a freely available platform evidently boosts scientific collaboration and reproducibility ( see here, for an excellent example).
Are there any disadvantages? Of course! KNIME has a rather steep learning curve and beginners usually need to spend some time with it before they can appreciate how it all works. Furthermore, nodes are often treated as “black boxes” and thus users can be disconnected from the underlying algorithms, which may lead to misuse of some methodology. Despite the very large number of nodes available, there are times when the functionality that you are looking for, e.g. manipulation and standardisation of chemical structures, is simply not there. In such cases, one may have to develop proprietary nodes. However, as KNIME integrates nicely with popular scripting languages such as Perl, Python and R, developing new functionality is not that difficult.
At the ChEMBL group, we provide several examples of workflows that combine the open data analysis capabilities of KNIME with our open data, resources and tools, such as the ChEMBL web services, UniChem and myChEMBL. Please get in touch, if you’d like more information on these. We also run a drug discovery course every year which includes a hands-on workshop on KNIME.