Welcome to iModulonDB! This is a web-based tool for accessing a database of transcriptomic dataset decompositions. If you are a biologist interested in what machine learning can tell us about the regulation of bacterial gene expression, this site will provide very valuable tools for you.
All living things need to adapt their gene expression to the situation they are in. The transcriptional regulatory network (TRN) is the system within cells that enables this. For instance, consider the gene glpT in E. coli, which is used to import a specific type of food, glycerol. When glycerol is present in the cell’s media, components of the TRN will sense it and respond by upregulating glpT so that the cell can use it for growth. The TRN controls metabolism, stress responses, growth, and nearly every biological process. It is therefore of particular interest in science to understand how it works.
Traditional methods for studying the TRN have focused on individual genes and regulators, especially in model organisms. Wet lab microbiologists have spent decades deleting or overexpressing particular genes, making hypotheses about the effects, and testing them in drill-down studies. In this way, researchers have built a bottom-up understanding of how the TRN works. This approach is very important for making confident claims about individual systems, but it is time-consuming, costly, and often fails to support mathematical models of global gene expression.
Thus, we have developed a new approach which creates a top-down perspective on the TRN. We start with a transcriptomic dataset, which measures the expression of each gene under several conditions. There are thousands of genes in any given organism, and we hope to understand as much of the data as possible – therefore, we need to use machine learning to look for patterns in that dataset which give us valuable insight into the underlying regulation.
Why study bacterial transcriptional regulation?
This is a transcriptomic dataset. Each column represents an experimental condition that cells were subjected to, and each row represents a gene. Each element is an expression value indicating how active the gene is under the given condition (typically measured using RNA-sequencing or microarrays). This dataset has been normalized such that the entire left column, representing the baseline condition of simple growth on glucose, is zero (white), and positive and negative values in other elements indicate that the gene is more or less expressed than it is in the baseline. At iModulonDB, we generate some of the datasets we analyze in-house and download many of them from online sources such as the Sequence Read Archive.
Each of these plots represents a row from the dataset as a bar graph. We show two related genes: glpT, which imports glycerol, and glpA, which helps break down glycerol. The x axis labels are names given to the various projects in which the samples were collected (for example, "Acid" refers to a project done in acidic media with deletion of various acid response regulators). Hovering over the tallest bars reveals which conditions cause each of the genes to be upregulated. You will notice that in both graphs, activity is highest when glycerol is the carbon source. It would be convenient to treat both of these genes as part of a single glycerol-consumption unit in the transcriptome - the goal of our approach is to find all such units with unsupervised machine learning.
Data scientists have developed machine learning algorithms that can address the problems described above. Unsupervised machine learning algorithms can identify patterns and structures underlying big datasets by simply using the information in the dataset itself. Independent Component Analysis (ICA) is one such method, and it has outperformed many other algorithms in detecting co-regulated gene sets.
By running the ICA algorithm on a transcriptomic dataset (see our github), we generate a set of iModulons. Each iModulon is a group of genes that represents an independently modulated signal, which the cell is probably controlling using the same or related regulators. Mathematically, an iModulon has a weight for each gene and an activity for each condition. The highly weighted genes are iModulon members, and the highly active conditions are those that the iModulon is likely performing a function in. We characterize an iModulon by interpreting its gene members and activity levels. For example, the glpR iModulon contains all the genes that are associated with digesting glycerol, and it is active when glycerol is present in the media. We named it 'GlpR' because that is the name of the transcription factor that co-regulates all of its genes.
X: Original transcriptomic dataset. We make the assumption that the X matrix results from a mix of underlying signals (iModulons) controlled by regulators like transcription factors (TFs), and we use the ICA algorithm to identify those signals.
M: Links genes to iModulons. A gene that is highly weighted is said to be a “member” of the iModulon, and all iModulon members are expressed as a group. If the iModulon is a sports team, the M matrix defines who the players are.
A: Links iModulons to conditions. If an iModulon is highly active in a given condition, it is probably carrying out a function that is important in that condition. If the iModulon is a sports team, the A matrix describes its playbook.
ICA was originally developed in the 1980s to solve the blind source separation problem, also known as the cocktail party problem. Imagine you place microphones around a crowded, noisy room. Each microphone would pick up different combinations of each speaker. If we apply ICA to the resultant set of recordings, we can identify the original source signals (M) without any other information. In addition, ICA infers the volume of each source in each microphone-measured signal (A).
Similarly, a transcriptomic dataset acts like microphones into the cell, measuring the combined effects of different transcriptional regulators with various condition-specific activities. The regulators/iModulons are behaving independently in the cell, the same way that the people in the room behave independently.
iModulon structures are computed from datasets, so the quality and breadth of the data is extremely important.For our original analysis, we developed PRECISE-278 (Formerly PRECISE 1), the Precision RNA-seq Expression Compendium for Independent Signal Extraction with 278 diverse samples from E. coli K-12. We have also developed PRECISE datasets in other organisms, such as S. aureus.
One of the strengths of ICA is that it requires only transcriptomic data, which means we can also re-use existing, publicly available data. We analyzed a single-lab microarray dataset on B. subtilis and showed that different datasets and transcriptomic technologies generally return similar iModulons.
Next, we sought to apply this approach across the evolutionary tree of prokaryotes. We scraped the Sequence Read Archive for datasets from its most popular prokaryotic strains, and combined all high quality data for analysis. Since this works towards the goal of finding the set of all computable iModulons, we named these projects “Modulome”. iModulonDB rapidly expanded in its first two years thanks to this project. See our home page for the list of all organisms we have analyzed so far.
The purpose of this site is to share the powerful top-down approach of ICA with other systems biologists and microbiologists. We hope that searching for the genes and functions relevant to your research will point you toward iModulons that expand your understanding of which genes are most important to your application. We are currently working on decomposing additional datasets to compute transcriptional regulatory networks across many organisms, which will advance our knowledge significantly in the age of big data.
To use this site, select your favorite organism from the list on our home page. This will take you to a dataset page, which contains a list of the iModulons we have computed and characterized, as well as the publication in which we describe the set. Click on a row in the iModulon table to see its dashboard, where you can learn about its gene members, activity, and regulator enrichments.
Alternatively, select 'Gene Search' from the dataset page and type in your genes of interest. This will bring you to a similar dashboard, listing the most relevant iModulons for your gene. Note that some genes are removed prior to running ICA if they are never expressed or shown to be extremely noisy within conditions; if that is the case, your gene will not show up in our search.
For a description of each of the figures shown on the iModulon dashboards hover over the various components of the example “MalT” iModulon below:
For questions, comments, feedback, or to collaborate with us, please send an email to Kevin Rychel (imodulondb@ucsd.edu).
For more information on the Systems Biology Research Group (SBRG) at the University of California, San Diego, please see our website here.