Involved group members: Miikka Vikkula, Raphaël Helaers
We host the UCL microarray platform (Affymetrix), used by several groups in the de Duve Institute and UCL for expression profiling as well as genotyping. This platform is complemented by High Throughput Sequencing equipment. Funded by the Fondation Contre le Cancer, it consists of a Solid 5500XL sequencer (Life technologies), a Personal Genome Machine (Ion Torrent, Life Technologies), a Proton (Ion Torrent, Life Technologies) and a computing cluster for bioinformatic processing. This equipment allows us to perform Exome-seq, Genome-seq, RNA-seq, Small RNA profiling, ChIP-seq and methylation studies. Analysis of data produced by Life Technologies equipment is performed using their software (Lifescope, Torrent Suite), and a combination of open source packages (BWA, GATK, snpEff). Complete analysis pipelines are also avaible for other technologies (e.g. Illumina). Downstream evaluation and prioritization of variants is performed using “Highlander”, a package that integrates several in-silico analysis programs and utilities with a user-friendly graphical interface (developed in-house). This enhances our ability to identify and explore the genetic and epigenetic bases of disease.
HIGHLANDER, A SOFTWARE FOR EASY VARIANT FILTERING
Next generation sequencing produces massive amounts of data. Targeted exome sequencing can be completed in a few days using NGS, allowing for new variant discovery in a matter of weeks. The technology generates considerable numbers of false positives, and the differentiation of sequencing errors from true mutations is not a straightforward task. Moreover, the identification of changes-of-interest from amongst tens of thousands of variants requires annotation drawn from various sources, as well as advanced filtering capabilities.
We developed Highlander, a Java software coupled to a MySQL database, in order to centralize all variant data and annotations from the lab, and to provide powerful filtering tools that are easily accessible to the biologist. Data can be generated by any NGS machine, (such as Illumina’s HiSeq, or Life Technologies’ Solid or Ion Torrent) and most variant callers (such as Broad Institute’s GATK or Life Technologies’ LifeScope). Variant calls are annotated using DBNSFP (providing predictions from 6 different programs, and MAF from 1000G and ESP), GoNL and SnpEff, subsequently imported into the database. The database is used to compute global statistics, allowing for the discrimination of variants based on their representation in the database. The Highlander GUI easily allows for complex queries to this database, using shortcuts for certain standard criteria, such as “sample-specific variants”, “variants common to specific samples” or “combined-heterozygous genes”. Users can browse through query results using sorting, masking and highlighting of information. Highlander also gives access to useful additional tools, including direct access to IGV, and an algorithm that checks all available alignments for allele-calls at specific positions.
EXCALIBUR, A GENETIC REGION CLASSIFIER
(Raphael Helaers, Miikka Vikkula)
The goal of this project is to expand Highlander by developing and integrating a new classifier. In order to maximize the efficiency of our analyses and make the best use of publicly available biological data when studying the genetic underpinnings of human disease, the classifier will use 3 main strategies: the identification of regions of interest, a new statistical framework and, in order to integrate these, a dedicated machine learning algorithm. This approach is critical in the analysis of genetically heterogeneous and complex, multigenic disorders such as breast cancer, primary lymphedema,and cleft lip and/or palate, subjects of large-scale sequencing projects in the lab. We will develop bioinformatics tools to exploit public and in-house exome and whole genome data using association testing and classification. First, we will consider the entire gene to be the analyzed unit (not individual variants as most classifiers do). In a second layer, we will group genes based on their involvement in the same molecular/biological pathways. Finally, for whole genome data, we will need to define completely new units of analysis. We will then develop a new statistical framework for testing association of variants within the genetic regions of interest defined. This will be followed by a pipeline to automatically generate hypotheses, collect all relevant data, and run them through the statistical framework. A machine learning algorithm will be developed to use the results from the pipeline to rank competing hypotheses. We will make this available as a free, stand-alone classification tool, and benchmark it against published and available data. Last but not least, we will integrate it into Highlander and perform use-cases to identify novel associations of variants with disease. This project is the main topic of the PhD thesis of Simon Boutry.
OTHER BIOINFORMATICS PROJECTS
We are currently developing a comprehensive LIMS (Laboratory Information Management System) that will allow us to efficiently manage data from our biobanks, linking samples, patients, clinical information, experiments, results and publications.
Some projects were also developped by master and bachelor students for their thesis:
- Simon Boutry (Master of Civil Engineer -applied mathematics-, UCL) explored different statistical frameworks allowing association testing on sets of genetic variants, and developped an R package to classify them.
- Gaël Leroy (Bachelor of Business Computing, Haute École Léonard de Vinci, Institut Paul Lambin) implemented a Java tool to annotate variants generated by Ion Torrent technology, allowing automated elimination of a maximum of false positives and sequecing artefacts.
- Hélène Libion (Master of bioinformatics, ULB) developped a Java tool to explore lists of genetic variants aggregated by biological pathways.
We also participate in regional and national development of analysis pipelines. As bioinformatics becomes an essential component of genetic research, other in-silico projects will be implemented in the future.