Modern high throughput biology produces more data that can be analyzed, and the challenges of modern biology are statistical interpretation and integration of these data. The researchers and engineers in the Computational Biology group devise novel computational techniques to comprehend high dimensional biology and enable high throughput biomedical research.
For the last ten years, biology and biomedical sciences have seen an impressive increase in the size of the data that are collected as part of routine research projects. The increase in amount and complexity of such data leads some to call it a data deluge. Indeed, we have reached a situation where the sheer volume of data that is produced is overwhelming the capacity of individual researchers and research groups to manage, analyze and extract meaningful information from them. This revolution is shifting biomedical research towards the quantitative side of science, and has been driven by the technological breakthroughs that, today, allow us to sequence whole genomes, quantify the near complete set of transcripts or proteins, measure epigenetic modifications across whole genomes, assay proteins for post-translational modifications, interactions and localization. But the question remains: what to do with all that data?
To illustrate how large amounts of data can be difficult to deal with, let’s use the Lego bricks as an analogy. In a classic biological experimental setup, researchers would focus on a particular gene or small set of genes of interest. They would design an experimental setup to address their specific question, run the experiments and, after collecting the data, (mostly) manually analyze them. They would draw conclusions that would either support or deny their working model and follow-up by designing the next set of experiments accordingly. From a Lego point of view, this would correspond to acquiring a Lego box as we know it, i.e. containing all the blocks needed to build the model (and only those blocks) and a precise and accurate building plan. There is no need for any special tool to find the blocks and figure out how to assemble them; even for relatively large sets, given enough time, it’s easy enough to follow the instructions and produce the final product.
Now imagine that either the blocks or the building map are messed up.
Imagine that your box contains many more blocks than you actually need, that some of the blocks needed for your final product are missing (and that you don’t know if they are, or which ones are missing), and that the blocks aren’t sorted in little sachets, but provided in a single huge bag, with potentially several orders of magnitude extra blocks. Now imagine that the instructions are missing steps or pages at random, or that the instructions are missing completely, but that you have some idea of what needs to be built. In such situations, you’ll need algorithms and tools that will automatically sort the blocks, and arrange them by size, color, shape, … and algorithms that will inform you what pieces are most likely relevant for the model you want to build.
This sounds like a hopeless situation, but it’s not. There are countless opportunities in acquiring a lot of data, even when the actual building instructions are missing. Indeed, the extra blocks aren’t random, they are part of something bigger. Imagine you originally wanted to build the Millennium Falcon from the Star Wars movies, and that the reason you want to build that ship is that you are interested in the technology of the Rebel Alliance, or even in the whole Star Wars universe. Even if the extra blocks aren’t directly relevant to building the Millennium Falcon, they might provide precious information about the technology that was available when the ship was built. With the right algorithms, you might be able to build your ship, and collect additional information about the Star Wars universe. Or, even if you don’t manage to build the whole ship and achieve only a partial, incomplete product, the additional information might actually reveal much more about the Lego Star Wars universe than solely focusing on the one ship.
Methods that would consider the Lego blocks only, without any additional information (such as whether some blocks are used to build Rebel or Empire ships, or parts of the instruction manual) are termed “unsupervised”. Such methods could be used to group all the blocks and identify clusters of blocks with similar features. If additional information is at hand, such as whether a block is used for a type of ship, and we would want to classify a new block as to what type of ship it belonged, one would refer to a “supervised” analysis. Given the sheer number of blocks at hand, we would also want to summarize our collection by counting the types of blocks, how many blocks of each type we have, and visually represent this diversity.
In bioinformatics, the blocks would typically be replaced by quantitative measurements of the abundance of biological entities, such as transcripts or proteins. The annotation for the supervised analysis would describe whether these samples are wild-type cell lines, or from healthy donors or, on the contrary, cells submitted to a particular drug of interest or with a missing gene, or from patients suffering from a specific disease.
Another important feature of the data and the nature of modern, high throughput biology, is that the questions that are now asked have shifted from unequivocal and universal to context-specific, probabilistic, and definition-dependent (see Quincey Justman  for an insightful documentation of this). The complexity of what we measure and what we ask requires us to accept that certainties and determinism are replaced by probabilities and uncertainties that need to be quantified to acquire confident knowledge.
This is the figurative situation modern biomedicine is in: tremendous potentials to gain a much broader picture of the whole cell, organ, body, but at the cost of a complexity in what we measure, and the need for bespoken methods to sort and manage the data we acquire and to analyze and understand it. That is the role of bioinformatics and computational biology, i.e. to devise ways to understand complex biological data to comprehend complex biological processes.
Expert Rev Proteomics. 2021; pp1-9.
Nat Commun. 2021; 12(1):5773.
PLoS Comput Biol. 2020; 16(11):e1008288.
J Proteome Res. 2021; 20(1):1063-1069.
Biomolecules. 2020; 10(8):1120.
F1000Res. 2019; 8:446.
PLoS Comput Biol. 2018; 14(11):e1006516.
Nat Biotechnol. 2012; 30(10):918-20.
F1000Research 2021; 10:416.
COMPUTATIONAL BIOLOGY AND BIOINFORMATICS