21st century biology developed large-scale methods for genomics, epigenomics, proteomics and generates vast amounts of data. This includes DNA sequences, gene mutations, epigenetic modifications, gene expression, post-transcriptional regulation, protein levels, drug-protein-interactions, clinical parameters and much more. But while we generate lots of data, we lack methods to efficiently store, manage and analyze them.
We need better solutions to robustly combine all the knowledge from molecular omics data to clinically relevant covariates. In this challenge, we try to identify new drug targets by integrating public data sets of cancer related omics experiments.
Research projects such as TCGA and ENCODE produce huge amounts of omics data for all kinds of biological samples and diseases. These data sets are very heterogenous and span all the different levels of cellular activity. They are generated in different experiments and measures things on different scales. Currently, data integration is tedious and requires a lot of manual work and expert knowledge.
Before we can do anything with the data we need to get an overview, see the connections and understand the relevant biological questions. For that we have to clean, structure and integrate everything and enrich it with prior knowledge. This enables the first steps in data analysis, such as identification of relevant genes and disease specific regulation. To facilitate this, we will develop new ways to store data and generate noSQL database models. Options are Elasticsearch, mongoDB, Cassandra or neo4j.
1. Develop a data model that can represent all the different data types
2. Get data sets for a couple of samples from TCGA/ENCODE and drug targeting data
3. Find genes that are relevant in specific cancer types
4. Identify molecules targeting these genes
5. Develop an API to allow flexible access to the datasets for e.g. predictive analytics and machine learning
Challenge owner: Martin Preusse (Neo4j)