Welcome to iModulonDB! This is a web-based tool for accessing a database of transcriptomic dataset decompositions. If you are a biologist interested in what machine learning can tell us about the regulation of bacterial gene expression, this site will provide very valuable tools for you.
All living things need to adapt their gene expression to the situation they are in. The transcriptional regulatory network (TRN) is the system within cells that enables this. For instance, consider the gene glpT in E. coli, which is used to import a specific type of food, glycerol. When glycerol is present in the cell’s media, components of the TRN will sense it and respond by upregulating glpT so that the cell can use it for growth. The TRN controls metabolism, stress responses, growth, and nearly every biological process. It is therefore of particular interest in science to understand how it works.
Traditional methods for studying the TRN have focused on individual genes and regulators, especially in model organisms. Wet lab microbiologists have spent decades deleting or overexpressing particular genes, making hypotheses about the effects, and testing them in drill-down studies. In this way, researchers have built a bottom-up understanding of how the TRN works. This approach is very important for making confident claims about individual systems, but it is time-consuming, costly, and often fails to support mathematical models of global gene expression.
Thus, we have developed a new approach which creates a top-down perspective on the TRN. We start with a transcriptomic dataset, which measures the expression of each gene under several conditions. There are thousands of genes in any given organism, and we hope to understand as much of the data as possible – therefore, we need to use machine learning to look for patterns in that dataset which give us valuable insight into the underlying regulation.
Why study bacterial transcriptional regulation?
This is a transcriptomic dataset. Each column represents an experimental condition that cells were subjected to, and each row represents a gene. Each element is an expression value indicating how active the gene is under the given condition (typically measured using RNA-sequencing or microarrays). This dataset has been normalized such that the entire left column, representing the baseline condition of simple growth on glucose, is zero (white), and positive and negative values in other elements indicate that the gene is more or less expressed than it is in the baseline. At iModulonDB, we generate some of the datasets we analyze in-house and download many of them from online sources such as the Sequence Read Archive.
Each of these plots represents a row from the dataset as a bar graph. We show two related genes: glpT, which imports glycerol, and glpA, which helps break down glycerol. The x axis labels are names given to the various projects in which the samples were collected (for example, "Acid" refers to a project done in acidic media with deletion of various acid response regulators). Hovering over the tallest bars reveals which conditions cause each of the genes to be upregulated. You will notice that in both graphs, activity is highest when glycerol is the carbon source. It would be convenient to treat both of these genes as part of a single glycerol-consumption unit in the transcriptome - the goal of our approach is to find all such units with unsupervised machine learning.
Data scientists have developed machine learning algorithms that can address the problems described above. Unsupervised machine learning algorithms can identify patterns and structures underlying big datasets by simply using the information in the dataset itself. Independent Component Analysis (ICA) is one such method, and it has outperformed many other algorithms in detecting co-regulated gene sets.
By running the ICA algorithm on a transcriptomic dataset (see our github), we generate a set of iModulons. Each iModulon is a group of genes that represents an independently modulated signal, which the cell is probably controlling using the same or related regulators. Mathematically, an iModulon has a weight for each gene and an activity for each condition. The highly weighted genes are iModulon members, and the highly active conditions are those that the iModulon is likely performing a function in. We characterize an iModulon by interpreting its gene members and activity levels. For example, the glpR iModulon contains all the genes that are associated with digesting glycerol, and it is active when glycerol is present in the media. We named it 'GlpR' because that is the name of the transcription factor that co-regulates all of its genes.
X: Original transcriptomic dataset. We make the assumption that the X matrix results from a mix of underlying signals (iModulons) controlled by regulators like transcription factors (TFs), and we use the ICA algorithm to identify those signals.
M: Links genes to iModulons. A gene that is highly weighted is said to be a “member” of the iModulon, and all iModulon members are expressed as a group. If the iModulon is a sports team, the M matrix defines who the players are.
A: Links iModulons to conditions. If an iModulon is highly active in a given condition, it is probably carrying out a function that is important in that condition. If the iModulon is a sports team, the A matrix describes its playbook.
ICA was originally developed in the 1980s to solve the blind source separation problem, also known as the cocktail party problem. Imagine you place microphones around a crowded, noisy room. Each microphone would pick up different combinations of each speaker. If we apply ICA to the resultant set of recordings, we can identify the original source signals (M) without any other information. In addition, ICA infers the volume of each source in each microphone-measured signal (A).
Similarly, a transcriptomic dataset acts like microphones into the cell, measuring the combined effects of different transcriptional regulators with various condition-specific activities. The regulators/iModulons are behaving independently in the cell, the same way that the people in the room behave independently.
iModulon structures are computed from datasets, so the quality and breadth of the data is extremely important.For our original analysis, we developed PRECISE-278 (Formerly PRECISE 1), the Precision RNA-seq Expression Compendium for Independent Signal Extraction with 278 diverse samples from E. coli K-12. We have also developed PRECISE datasets in other organisms, such as S. aureus.
One of the strengths of ICA is that it requires only transcriptomic data, which means we can also re-use existing, publicly available data. We analyzed a single-lab microarray dataset on B. subtilis and showed that different datasets and transcriptomic technologies generally return similar iModulons.
Next, we sought to apply this approach across the evolutionary tree of prokaryotes. We scraped the Sequence Read Archive for datasets from its most popular prokaryotic strains, and combined all high quality data for analysis. Since this works towards the goal of finding the set of all computable iModulons, we named these projects “Modulome”. iModulonDB rapidly expanded in its first two years thanks to this project. See our home page for the list of all organisms we have analyzed so far.
The purpose of this site is to share the powerful top-down approach of ICA with other systems biologists and microbiologists. We hope that searching for the genes and functions relevant to your research will point you toward iModulons that expand your understanding of which genes are most important to your application. We are currently working on decomposing additional datasets to compute transcriptional regulatory networks across many organisms, which will advance our knowledge significantly in the age of big data.
To use this site, select your favorite organism from the list on our home page. This will take you to a dataset page, which contains a list of the iModulons we have computed and characterized, as well as the publication in which we describe the set. Click on a row in the iModulon table to see its dashboard, where you can learn about its gene members, activity, and regulator enrichments.
Alternatively, select 'Gene Search' from the dataset page and type in your genes of interest. This will bring you to a similar dashboard, listing the most relevant iModulons for your gene. Note that some genes are removed prior to running ICA if they are never expressed or shown to be extremely noisy within conditions; if that is the case, your gene will not show up in our search.
Lastly, select the 'Anaylsis Page' to conduct dataset-wide analyses on the curated datasets in iModulonDB.
To learn more about each page in iModulonDB, select from the buttons below. Hover over the figures for an additional description not provided on the corresponding dashboard page.
Here, the iModulon name is given in large font. Below, several descriptors are given, including the biological function and category of iModulon. If the iModulon is found to map to a known regulator, then the "Regulated by" descriptor will be followed by the regulator name. If available, the regulator name will appear as a link to the appropriate database (e.g. RegulonDB). Other statistics, such as the Precision, Recall, and Explained Variance are also shown.
The gene table lists the genes in an iModulon. Clicking a row will take you to the 'Gene Page' for that gene. This table is scrollable in both directions. Clicking the arrows in the header sorts the contents by the given feature, and right clicking on the header allows columns to be moved to the right end or hidden. This table is also downloadable from the "Download" dropdown menu in the site header.
For each statistically-determined signal, each gene has a "weight" that represents its importance towards the signal. Genes with weights that occur above a determined threshold (outside the vertical lines) are considered to be a part of the iModulon. Note that the y-axis has a logarithmic scale. Hover over the bars to see the associated genes. If the iModulon has a regulator enrichment, the genes in the regulon will be shown in color. Click the elements in the legend to hide or show the associated bars.
Similar to the gene histogram, the gene weight scatter plot shows the weight of each gene towards the signal. In this figure, the gene weights are plotted against the gene start site. They are colored by category. Again, genes outside the horizontal lines are in the iModulon – click on them to access their 'Gene Pages'. Gene names, values, and categories can be displayed by hovering over each point.
The activity bar graph shows the relative activity level of the iModulon across all the experimental samples included in the dataset. The plot is divided by the source study for the dataset, and clicking a bar will take you to the associated publication. To better understand each study, the experimental and control variables are provided alongside the publication's abstract in the "Projects" section of the "Dataset Page". The black points represent each biological replicate for the sample, and the blue bars represent the averaged activity from the biological replicates. Hovering over the data shows the sample name, the activity value, and some associated metadata. The menu in the top right of the figure includes additional options to download the figure or underlying data. The wrench icon allows you to see all the metadata we have for these samples; metadata with a checkmark will be displayed when you hover over a given bar. Click the ink button to color the samples in the bar by that metadata. For example, clicking the pH ink will color each sample by pH to easily reveal which samples are under acid or base stress.
If the iModulon is mapped to a regulator, the venn diagram shows the overlap between the genes in the iModulon and the literature-derived regulon. On the 'iModulon Page', you can hover over each part of the venn diagram to see the number of genes and gene names associated with each group in the diagram.
(Left) The correlation scatter plot appears if the iModulon can be mapped to a transcription factor or sigma factor. Multiple plots are generated for each known regulator. The plot shows the linear correlation between the iModulon activity level and the expression of its regulator across conditions. Samples from experiments where the iModulon or its regulator have been perturbed (marked by "X") are not used to determine the correlation. Perturbation of an iModulon includes the deletion, mutation, or overexpression of i) its regulator; ii) one or more of its non-regulatory genes; or iii) a regulator of one or more of its non-regulatory genes. R2 value above 0.64 indicates a strong correlation. R2 value above 0.16 indicates a moderate correlation. Samples are color-coded by their associated publication. The legend shows R2 values calculated for non-perturbed samples grouped by their associated study.
(Right) Samples are grouped by study and the associated R2 value and number of samples are shown.
On the 'iModulon Page', hover over each sample to display the project and experimental condition or click each series in the legend to toggle its visibility on the graph.
For each gene in the iModulon:
Here, the gene name is given in large font. Below, several descriptors are given, including the locus tag, protein product, operon, biological function, and any known regulators of the gene. Links to the gene's transcription unit diagram (BioCyc, center) and its known protein-protein interactions (StringDB, right).
Examing the gene's operon structure can provide a good indication the gene belongs in an iModulon. For this reason, we provide direct links to BioCyc's operon diagram for all genes across all organisms in iModulobn DB.
Examining the protein-protein interaction network of the gene product can also provide a basis for its belonging to its statistically derived iModulon. If gene product directly with the products of other genes in the iModulon, it is a good indication that the iModulon structure describe a physiological set of co-expressed genes. For this reason, iModulonDB provides links to the PPI network found on StringDB.
Independent component analysis (ICA), the algorithmic basis for defining co-expressed genes (iModulons) in RNAseq datasets, may map the same gene to multiple iModulons. For each gene, a list of all iModulons is provided. The table is rank-ordered by its gene weight in each iModulon. The green checkmarks & red "X"s denote the gene's relationship (defined by ICA) to each iModulon. On the 'Gene Page', each row is clickable and directs to the corresponding 'iModulon Page'.
The bar graph shows the expression level of the gene across all the experimental samples included in the dataset. The plot is divided by the source study for the dataset.
(Left) The correlation of a gene's expression level to the activity of its assigned iModulon can provide a clear indication that the gene belongs in the iModulon. Samples are color coded by their associate publication study to facilitate understanding of condition-specific disruptions in iModulon activity or gene expression. Experimental conditions that describe samples where the gene or its iModulon are preturbed (marked by "X"s) are excluded from the correlation calculation. (Right) Non-perturbed samples are grouped by their associated study and the linear correlations are recalculated. An R2 greater than 0.64 indicates a strong correlation. An R2 greater than 0.16 indicates a moderate correlation.
iModulonDB is comprised of RNAseq data generated from over 200 publications. For each dataset, samples are separated by their associated publication and project name. The project names are displayed as clickable buttons. In this example, the "acid" project from the "precise-278" dataset has been selected to reveal the associated publication information and sample metadata below.
When available, the title, authors, and abstract of the publication associated with each project is displayed.
This section describes all variables that remain constant across all samples in this project. This information comes directly from the metadata of each sample. In this example, all samples are taken from unevolved E. coli MG1655 batch cultures in M9 Media at 37°C.
This section describes all variables that differ between samples in this project. This information comes directly from the metadata of each sample. In this example of the "acid" project, the first two samples are taken from wildtype strains while the next two samples are ΔgadX strains. This table is scrollable to reveal all samples in the selected project.
A table of all iModulons across the dataset is displayed. For each iModulon, its name, known regulators, function, COG category, number of genes (N) are provided. Explained variance, precision, and recall of the ICA method is also provided. The rows in this table are clickable and redirect to the corresponding 'iModulon Page'. The rows in this table are selectable and are used for the pairwise iModulon analysis below. In this example, GadEWX and GadWX iModulons are selected for futher analysis on the 'Dataset Page' below.
If the activity of two iModulons are correlated, their biological functions may be synergistic. After selecting two iModulons from the table above, the Analyze iM Pairs button will find the linear correlation between the activity of both iModulons across all samples. Up to four iModulons can be selected from the table above.
The iModulon Phase-Plane refers to the pairwise scatter plot of the activity of two iModulons. The linear-fit and R2 value are provided. Samples are color-coded by project. On the 'Dataset Page', hover over each datapoint to reveal additional sample metadata.
Synergistic functions are often indicated by overlapping sets of genes between multiple iModulons. The set of genes belonging to one or both of the selected iModulons is provided in the venn diagram. On the 'Dataset Page', hover over each section of the graph to reveal the genes (names) in each section. Note: correlated iModulons with little or no gene overlap provide oppurtunies for novel understanding and future investigation.
An additional level of understanding of the synergy between two iModulons can be added by understanding which genes are most importantly weighted in each iModulon. On the 'Dataset Page', hover over each datapoint to reveal the gene.
Often, iModulon activity can be affected by the experimental conditions in a specific project. Within a project, experimental variables (e.g. samples from evolved vs non-evolved strains) can shift the activities of the selected iModulons (see Dalldorf et al. on the fear vs greed tradeoff described by the RpoS/Translation iModulon phase plane). On the 'Dataset Page', you can select projects to highlight on the phase-plane graph by selecting from the projects below and clicking the analyze button again. All conditions are selected by default.
iModulonDB allows for three dataset-wide analyses. Each analyses searches the entire dataset for correlations between i) the activities or iModulon pairs; ii) the activity of an iModulon and the expression of its regulator(s); iii) the activity of an iModulon (with unknown regulation) and all the expression known regulators. On the 'Analysis Page', you are prompted to select and R2threshold for which to plot correlations.
On the 'Analysis Page', you are prompted to select and R2threshold for which to plot correlations. R2 values greater than 0.64 indicate a strong correlation. R2 values greater than 0.16 indicate a moderate correlation.
Using an R2 threshold of 0.64, we conduct a dataset-wide analysis on the precise1k dataset in E. coli and display an example result from each analysis below.
After you have selected a desired R2 threshold, click the analyze button to identify all iModulon pairs whose activities are correlated.
On the 'Analysis Page', all iModulon pairs whose activity is correlated above your desired threshold will be plotted. See the 'Dataset Page' tab for descriptions of each graph. The Quorum Sensing / Proprionate Phase-Plane is one of the correlated iModulon pairs identified in the precise1k dataset of E. coli.
During biofilm formation, bacterial cells use chemical signals to communicate in a process called quorum sensing. Proprionate has been shown to inhibit biofilm formation in S. Typhimurium (Liu et al.).
In precise1k, the activities of the quorum-sensing and proprionate catabolism iModulons are strongly correlated. Thus, the E. coli transcriptional regulatory network may encode for the co-expression of these two distinct sets of genes (see Venn Diagram) to perform synergistic functions that promote biofilm formation. Perturbed samples (Minicoli)
On the 'Analysis Page', the regulators and functions are displayed below the graphs associated with each correlated iModulon pair. Quick accesss to this information facilitates knowledge-mining.
This analysis identifies all pairs of iModulons and their annotated regulators for which the activity of the iModulon and the expression of the regulator are correlated above a desired threshold.
In precise1k, the strongest iModulon-Regulator correlation is the Curli-1 iModulon / CsgD regulator pair. The CsgD regulon overlaps almost entirely with the Curli-1 iModulon showing how this tool can be used to map iModulons to their regulators.
If there is no overlap between the genes of an iModulon and those of a known regulon, it is difficult to annotate a regulator to the iModulon. For iModulons whose regulator is unknown, this analysis identifies correlations between the iModulon's activity and the expression of all known regulators.
Cold shock induces the nsrR-rnr-rlmB-yjfI operon (Cairrao et al.). Although no samples in precise1k experience cold shock, both Cold Shock proteins and the NsrR regulator are expressed by E. coli under ROS and Antibiotic stress (ROS TALE, AntibiotICA). Is it possible that E. coli activates off-target stress responses, when attempting to alleviate acute stresses?
For questions, comments, feedback, or to collaborate with us, please send an email to Edward Catoiu (ecatoiu@ucsd.edu).
For more information on the Systems Biology Research Group (SBRG) at the University of California, San Diego, please see our website here.