Differential Expression and the Mouse Gut-Brain Axis
Introduction
This module provides data for an independent project exploring differential expression in RNA-seq data.
This activity will walk through differential expression analysis and use bioinformatics tools (R) to understand how gut bacteria (the gut microbiome) can influence the expression of genes in the brain (the gut-brain axis). You will work with real data from a mouse RNA-seq experiment.
The basic steps of your analysis are:
Identify mouse genes of interest for your project.
Characterize the gene expression of your genes of interest in the dataset.
Explore differential gene expression in the control vs experimental group.
The original study
The original study, Human Gut Microbiota from Autism Spectrum Disorder Induces Behavioral Deficits in Mice, was published in 2019 [1]. Gut microbiota are known to be different between individuals with ASD and individuals who are considered typically-developing. Additionally, some individuals with ASD also experience gastrointestinal symptoms, and their gut microbiota show the greatest difference when compared to the gut microbiota of typically-developing individuals. Some researchers have proposed that gut bacteria can influence some of the symptoms of ASD. The relationship between the intestinal microbiome and the development and function of the human brain is known as the gut-brain axis.
In this study, researchers explored whether they could induce ASD-like behaviors in mice by changing their gut microbiome. Mice in this experiment received fecal transplants. Some mice received transplants from humans who have been diagnosed with ASD, while other mice received transplants from humans who did not have any diagnosis (control). This allowed researchers to control the composition of the gut microbiome in each mouse. They discovered that colonization with gut microbiota was enough to induce ASD-like behaviors in the mice. They also let the mice breed and collected gene expression data from the brains of their offspring to explore whether changing the gut microbiota could result in changed gene expression. In particular, they discovered that the offspring of mice who received stool from ASD donors showed different gene splicing and expression profiles of certain ASD-relevant genes.
It is important to note that researchers are not suggesting that ASD is entirely induced by gut bacteria. There is a strong genetic component to ASD. Scientists have known for years that there are both genetic and environmental components to the development or severity of some ASD symptoms. This research explores one possible environmental component.
You can read the original research paper here.
The Gut-Brain Axis
The gut-brain axis is the term used for the proposed connection between the gut microbiome and the brain. In recent years, studies have identified links between neurological symptoms and the presence of intestinal inflammation. There are around 100 million neurons in the human gut, and around 70% of the serotonin (an important neurotransmitter) in the human body is made in the gut. Communication between the microbes in the gut and the brain happens through multiple routes, including immune signaling (signals sent to the brain by immune cells), stimulation of the vagus nerve, and activation of the enteric nervous system (a system of neurons in the intestinal that are often called the “second brain”) through formation of microbial-derived metabolites like tryptophan, short-chain fatty acids, or secondary bile acids.
In the figure above, the “up” arrow indicates an increase; this could be increased production of a protein, increased activity of a biological pathway, or a general increase in the presence of a structure. The “down” arrow indicates a decrease.
Researchers have identified multiple ways the intestinal microbiota in the gut might remodel the structure of the nervous system. This includes changes in myelination, neurogenesis, neurotransmitters and synaptic plasticity, growth of dendrites, activation of microglia, and changes in the permeability of the blood-brain-barrier. The exact mechanism for these changes is still unknown.
Why are the neural processes mentioned above important?
myelination: the process of forming protein sheaths (called myelin) around the long axons of nerves. This insulates the axon and improves the speed and efficiency of transmission of neural signals.
neurogenesis: the formation of new neurons, particularly in the adult central nervous system. This is important for memory and learning.`
synaptic plasticity: the process of strengthening or weakening connections at the synapse of nerves (where the signal is transferred from one nerve to another). This sort of adaptive change is fundamental for memory and learning, but also helps with recovery after injury.
dendritic growth: the process by which neurons form new branches (dendrites) and spines (small protrusions on the dendrites). This is a key aspect of synaptic plasticity and is essential for brain development, learning, and memory formation.
microglia: immune cells found in the central nervous system. They support central nervous system development and synaptogenesis (formation of new synapses), maintain homeostasis, contribute to an immune response against infectious agents, and are associated with adult neurogenesis, neuroinflammation, degenerative diseases, stroke, trauma and regeneration.
blood-brain-barrier: a highly selective semi-permeable barrier located in the central nervous system that separates the bloodstream from the fluid that surrounds the brain. It maintains a stable environment for neuronal function and regulates passage of molecules between the brain and the rest of the body.
Remodeling of the nervous system via the gut microbiome is suspected to be responsible for the severity of some symptoms associated with neurodevelopmental and neurodegenerative disorders. A possible link between the gut microbiome and increased symptom severity is actively being studied for Major Depressive Disorder, Autism Spectrum Disorder, fibromyalgia, migraines, Parkinson’s disease, Alzheimer’s disease, and more.
The samples in this dataset are from mice that do not carry genes associated with either neurodevelopmental or neurodegenerative diseases. Instead, any symptoms or changes in the brain they show are the result of microbes introduced to their gut via fecal transplant. This dataset gives you the opportunity to study possible effects of the gut-brain axis firsthand.
Deciding on your research question
The data for this activity includes gene expression data from two different brain regions (striatum and prefrontal cortex) in mice. Additionally, there are two different types of mice. Some of the mice are an Autism Spectrum Disorder-type (see below for details), while others are control-type.
Before you start working with the dataset, you will need to make some decisions about the questions you want to answer.
Are you interested in exploring gene expression differences between two regions of the mouse brain, without focusing on the gut-brain axis? Expand this section to read more.
This dataset includes gene expression data from two different regions in the brain: the striatum and the prefrontal cortex.
The striatum is part of the brain involved in motor control and cognitive tasks like reward processing, decision-making, and social interactions (often called executive functions). It lies deep within the center of the brain and is composed of both gray matter (which can be thought of as the “processing” part of brain tissue) and white matter (which is the brain structure involved in transporting messages); the combination of the gray and white matter give this region of the brain a striped appearance, resulting in the name “striatum”. The striatum is involved in both reflexive movement - that is, involuntary movement that happens as an immediate response to a stimulus - and slower, planned movement like walking. In Parkinson’s disease, some patients experience degeneration of parts of the striatum, resulting in spastic, uncontrollable movement.
The prefrontal cortex is the part of the brain that is primarily in charge of decision making, reasoning, personality, maintaining social appropriateness, and other complex behaviors that fall under the umbrella of executive functions. This can include planning, self-control, and working towards long-term goals. The prefrontal cortex is located in the very front of the brain, just behind your forehead. One of the most famous brain injury patients was Phineas Gage, a railroad worker who survived an iron rod through his forehead. His prefrontal cortex was destroyed in this accident, and doctors noted huge behavioral and personality changes. You can read more about his case here.
Are you interested in exploring neurodevelopmental disorders? These are disorders that are the result of altered brain development. Although this dataset was created using “ASD-type” mice, many neurological disorders have shared symptoms and probably share a biological basis. Expand this section to read more.
Although the following disorders can be diagnosed at any age, they are considered neurodevelopmental disorders because symptoms usually show up within the first two years of life. Many of these disorders have overlapping symptoms, including difficulty managing emotions, experiencing intense emotional outbursts, a dysregulated stress response system, difficulty with executive functions (like attention regulation, organization, and planning), and issues with social communication and unspoken social norms. Many of these disorders have similar biological roots, such as neuroinflammation, genetic variation, and disruptions in systems that rely on dopamine, norepinephrine, and serotonin (which can cause impaired connections between regions of the brain, called white matter connectivity). People with these disorders also show differences in brain regions like the prefrontal cortex and the amygdala compared to people without neurodevelopmental disorders. For example, malfunctioning of the hypothalamic-pituitary-adrenal (HPA) axis is common in neurodevelopmental disorders. The HPA axis regulates the brain and body’s response to stress. Many people with neurodevelopmental disorders show an unusually strong reaction to stress and can have difficulty recovering from it.
This is not a comprehensive list of all neurodevelopmental disorders - you may know of others that are not included! Any of these could be a good starting point for your research question.
Autism spectrum disorder (ASD) is a neurological disorder that affects behavioral and social interactions, among other things. Individuals diagnosed with ASD can experience a wide range of symptoms, including differences in social behaviors and communication styles, as well as intellectual disabilities and physical issues like sensory sensitivities or gastrointestinal problems. Some genes that have been suggested as ASD risk genes include SHANK3, NRXN1, CHD8, ARIH2, ZDHHC15, and TRPC6. Individuals with ASD also show changes in both white and gray matter volume, as well as an imbalance in connectivity of neural circuits (there are more connections and stronger connections than typical within local regions of the brain, and both fewer and weaker connections between distant regions of the brain). The neural circuits that are imbalanced connect brain regions like the prefrontal cortex and the amygdala.
Attention-Deficit/Hyperactivity Disorder (ADHD) is characterized by difficulties with inattention, hyperactivity, and impulsivity. Individuals may show predominantly inattention symptoms (predominantly inattentive ADHD), predominantly hyperactivity and impulsivity symptoms (predominantly hyperactive ADHD), or a combination of all (combined type ADHD). Some proposed ADHD risk genes are SKI, ZNF544, ST3GAL3, and PEX2. Additionally, researchers believe alterations to the prefrontal cortex and the dopaminergic system are major contributing factors for ADHD.
Learning Disorders (LDs) include disorders such as dyslexia, dyscalculia, and dysgraphia. Each of these is characterized by difficulty developing a fundamental academic skill, including reading (dyslexia), math and working with numbers (dyscalculia), writing (dysgraphia), listening (auditory processing disorder), speaking (language processing disorder), or non-verbal reasoning (nonverbal learning disorder). Individuals with LDs frequently show differences neural connectivity within brain regions like the prefrontal cortex, hippocampus, and parietal lobes, which may contribute to decreased brain processing speed and executive functioning. Research also suggests that altered tryptophan metabolism is associated with LDs.
Mood Disorders are a group of disorders in which affected individuals exhibit emotional state changes, which can include episodes of deep sadness (dysphoria), lack of feeling anything (anhedonia), and extreme happiness or euphoria (hypomania or mania). The most well-known of these disorders are Major Depressive Disorder (MDD) and Bipolar Disorder (BD). Some individuals with Schizoaffective Disorder or Schizophrenia (SCZ) also experience hallucinations and psychosis. Variation in serotonin transporter genes and dopamine receptor genes have been associated with Mood Disorders, as well as neurotransmitter alterations and brain circuit dysfunction, particularly within the prefrontal cortex and the limbic system (where the amygdala is).
Mood disorders like depression, bipolar, and schizophrenia are not currently listed as neurodevelopmental disorders in the American Psychiatric Association’s Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5). This is the official handbook of psychiatric disorders used for diagnosis in the field. However, researchers have suggested that mood disorders should be considered neurodevelopmental disorders due to growing evidence that neurodevelopmental pathways play a role in their development and manifestation. For the purposes of this module, we have included them in the “Neurodevelopmental Disorder” section.
If you’d like to read more about the biology of neurodevelopmental disorders, (Bertollo et al. 2025)[https://www.mdpi.com/2076-3425/15/3/307] is a great starting point!
Are you interested in exploring neurodegenerative disorders? These are disorders that that occur when brain structures and systems experience degradation and may be the result of similar biological processes that cause neurodevelopmental disorders. Expand this section to read more.
Neurodegenerative disorders are usually considered diseases of aging because a person’s risk of developing one these disorders increases as they age. These disorders are the result of the breakdown and eventual loss of neurons and synaptic connections. An individual’s specific symptoms and diagnosis are the result of where in the brain or central nervous system the breakdown of neurons and synapses occur. Although the symptoms and progression may differ between diseases, the biological basis is generally one of seven things: a genetic mutation, accumulation of misfolded proteins, excitotoxicity (overstimulation of glutamate receptors, ultimately causing cell death), loss of regulatory function of noncoding RNAs (causes biological processes to break down), oxidative stress, mitochondrial dysfunction, or neuroinflammation (when the immune system attacks and damages the neurons).
This is not a comprehensive list of all neurodegenerative disorders - you may know of others that are not included! Any of these could be a good starting point for your research question. Remember that this is a gut-brain axis dataset, so you should verify that disorders that are not on this list do have an association with changes in the gut microbiome first.
Alzheimer’s Disease (AD) is the leading cause of dementia in seniors and is marked by progressive decline in memory, cognition, and behavior. It generally begins as a result of neuron loss in the hippocampus, or the region of the brain responsible for episodic memory. Beta-amyloid plaques and clumps of tau proteins are deposited on nerve cells, which prevents normal nerve impulses from traveling between neurons. This eventually leads to cell death and deterioration of brain tissue. Tau proteins are also present in normal brain tissue to stabilize microtubules and aid transport within neurons. They cause problems when they become damaged and detach from microtubules to form tangles, or clumps.
Parkinson’s Disease (PD) is a movement disorder of the nervous system. Over time, individuals with PD lose their ability to regulate and control fine motor movement, resulting in tremors, rigid muscles, slow movement (bradykinesia), and difficulties with balance and coordination. They can also show cognitive decline. These symptoms are the result of gradual deterioration of dopamine-producing neurons, specifically in an area of the brain called the substantia nigra. The substantia nigra is responsible for motor movement and has two parts - the pars compacta, which produces dopamine, and the pars reticulata, which uses GABA (another type of neurotransmitter).
Multiple Sclerosis (MS) is an immune-mediated neurodegenerative disorder. In MS, the immune system attacks the protective myelin covering around neurons in both the brain and the spinal cord, causing inflammation, demyelination (of loss of the protective myelin sheath), and damage to the underlying neuron. Eventually, these neurons become less efficient at passing electrical impulses to connecting neurons. Individuals with MS experience fatigue, numbness or tingling, difficulties with coordination, blurred or double vision, and memory and concentration issues.
Amyotrophic Lateral Sclerosis (ALS), also known as Lou Gehrig’s Disease, is caused by progressive degeneration of motor neurons in the brain and spinal cord. The exact mechanisms that trigger this degeneration are not known, but research has suggested that oxidative stress and mitochondrial dysfunction might contribute to the development of ALS. Additionally, the genes C9orf72, SOD1, TARDBP, and FUS have been identified as risk factors for developing an inherited form of the disease.
If you’re interested in reading more about the biology of neurodegenerative disorders, Gadhave et al. 2021 is a good place to start!
Refining your research question
Once you’ve picked brain structure, a neurodevelopmental disorder, or a neurodegenerative disorder to focus on, you’ll want to start identifying specific genes for your research project. The following questions could be a good starting point.
Is there a gene or set of genes linked to a biological process you’re particularly interested in? This could include things like “cytokines”, “anti-inflammatory genes”, “genes involved in building neurons”, or any biological process you find interesting!
Are there a known risk genes for your disease or disorder of interest (or genes that are protective against this disorder)? Remember, you don’t have to limit yourself to Autism Spectrum Disorder, either; there are many disorders that have a shared genetic and biological basis! If you decide to start here, check that researchers have linked the symptoms of your disease or disorder of interest to the gut microbiome.
Are you curious about the standard “housekeeping” genes that are broadly expressed across cell types and involved in common biological processes, like DNA replication, metabolism, and cell cycle regulation? These processes are vitally important for proper cellular function, so disruption in them can cause problems for the organism.
Once you’ve picked a disorder or cellular process that interests you, start looking up research papers to find genes! You can use sites like PubMed and Google Scholar to look up studies. Skim through at least 4-5 different papers that contain some type of gene name or a biological process. DO NOT GET TOO LOST INTO THE PAPERS! Make sure you skim through them for important information, but be willing to set a paper aside if you’re struggling to find what you need.
Identifying genes of interest with the Mouse Genome Informatics database
The Mouse Genome Informatics database that tracks mouse genes and expression data. A full introduction to everything available through the MGI can be found here. We’ll reproduce some of it below.
The mouse as a model organism
The mouse is the most commonly-used model organism in laboratory work. In fact, mice and rats make up 95% of the lab animal population, and more than 80% of the research that has been awarded the Nobel Prize for Medicine was done at least in part with mouse models [2], [3].
So what makes mice such good model organisms for biomedical research? Well, first, they’re economical and relatively easy to keep. Since mice are small, they don’t require a huge amount of space or food. They also have fast reproductive cycles, so researchers can study multiple generations within only a few years. Most importantly, though, mice and humans are both mammals and have about 85% of their protein-coding genome in common. As a result, mouse physiology is quite similar to human physiology. The mouse circulatory, reproductive, digestive, hormonal, and nervous systems are frequently used as models to study how humans grow, age, and develop chronic diseases. They are particularly important model organisms for cancer research and neuroscience.
You can find additional information about how the mouse is used in research here!
Mouse gene IDs
In this dataset, genes are identified using their Ensembl Gene ID code. Every gene has an ID that looks something like this:
ENSMUSG00000000001
ENS stands for “Ensembl”. Ensembl is a genome database project managed by the European Bioinformatics Institute. It’s one of several databases like this. Others include NIH’s National Center for Biotechnology Information (NCBI; the organization that manages GenBank and PubMed) and the University of California, Santa Cruz (UCSC) Genome Browser. When a gene code starts with “ENS”, it means you should look up the gene code in the Ensembl database.
MUS stands for “Mus”, which is the genus for the mouse.
G stands for “Gene”. When you see “G” in the ID name, you know you are working with a gene. There are also codes for transcripts (“T”), exons (“E”), and proteins (“P”).
00000000001 is the numerical code associated with the gene.
Picking your genes
The MGI allows you to look up information about genes you’re interested in but don’t know much about. This is particularly helpful if you are interested in genes that were identified in other research organisms or in previous studies. You could also start by looking up a human disease and associated human risk genes, then identify the homologous mouse genes.
Make sure you are recording the name and gene ID for the mouse! Gene names are often similar between mice and humans, but the gene IDs will be very different. If you don’t use the correct gene ID when exploring this dataset, you won’t get very far.
There are multiple ways to use MGI to find information about your mouse genes.
Looking up details on a mouse gene when you have the Ensembl gene ID
Open the Mouse Genome Informatics website. To look up information on a particular gene of interest, choose the “Genes” button.
Next, type the gene ID into the “Search” bar up top. We’ll look up “ENSMUSG00000079516”.
After you type the gene ID into the Search bar and hit enter, you should see a new page with basic information about the gene. Click on the gene symbol (in this example, Reg3a) to get more detailed information.
You might come across some unexpected terms when you search the MGI for your gene ID. In addition to genes, Ensembl gene IDs are also given to “pseudogenes”, “putative genes”, and “lncRNA”.
pseudogene: This is a stretch of DNA that looks like a gene but doesn’t actually code for any protein products. It’s essentially a copy of a gene that contains mutations that prevent translation into a protein product. The mutations can include partial deletions, missing promoters, missing start codons, premature stop codons, frameshift mutations, or missing introns. Any of these are enough to result in a pseudogene.
putative gene: This is a DNA segment that is believed to be a gene, but its function and protein product has not been confirmed. They are frequently identified based on the presence of an Open Reading Frame. Putative genes are not given names until they become confirmed genes.
lncRNA: This stands for “long non-coding RNA”. lncRNA is a type of RNA molecule that is transcribed from DNA but does not code for proteins. These RNA molecules are at least 200-500 nucleotides long and play roles in various biological processes, like gene regulation.
On the new page that you open, details about the gene are organized into familiar categories. Down the left-hand side of the page, you will see sections about the chromosomal location, homology, gene ontology, expression data, and more. Most sections are expanded by default, but you’ll need to expand the “homology” section yourself.
Once this section is expanded, you can find information about possible human homologs to the mouse gene, including alternate names and where the human homolog is located in the human genome.
If you continue scrolling down the page, you can also examine the pathways and processes the gene product is involved in under the “gene ontology” section. Clicking on the blue squares takes you to a page with more information about how that particular gene was assigned to a pathway or molecular process.
Directly underneath the “ontology” section is information about when the gene is expressed during development. You can learn more information by clicking on the blue squares, or by clicking on the links in the upper right-hand corner. These links will take you to other websites.
Looking up genes based on human gene name
You may have identified a human gene that you’re interested in for your project, so you need to figure out what the mouse ortholog is and (more importantly) what the mouse Ensembl ID is. Luckily, the MGI allows you to search based on the human gene name.
Let’s say you’re interested in the human gene REG3A, pancreatic secretory protein that may be involved in cell proliferation or differentiation. It also is upregulated as a result of pancreatic inflammation.
Open the Mouse Genome Informatics website. To look up information on this human gene, type “REG3A” into the “Quick Search” bar and press enter.
You’ll notice that MGI returns 3 possible mouse genes that are orthologs to the human gene REG3A. This is totally normal! Remember, orthologs are gene that shares the same ancestry. In this case, there was likely a gene duplication event in the mouse ancestor after it split off from the human ancestor.
You can pick whichever ortholog you like (or even pick all of them). For this example, we’ll choose the mouse gene “Reg3a”. Click on the gene symbol (in this example, Reg3a) to get more detailed information.
You might come across some unexpected terms when you search the MGI for your gene ID. In addition to genes, Ensembl gene IDs are also given to “pseudogenes”, “putative genes”, and “lncRNA”.
pseudogene: This is a stretch of DNA that looks like a gene but doesn’t actually code for any protein products. It’s essentially a copy of a gene that contains mutations that prevent translation into a protein product. The mutations can include partial deletions, missing promoters, missing start codons, premature stop codons, frameshift mutations, or missing introns. Any of these are enough to result in a pseudogene.
putative gene: This is a DNA segment that is believed to be a gene, but its function and protein product has not been confirmed. They are frequently identified based on the presence of an Open Reading Frame. Putative genes are not given names until they become confirmed genes.
lncRNA: This stands for “long non-coding RNA”. lncRNA is a type of RNA molecule that is transcribed from DNA but does not code for proteins. These RNA molecules are at least 200-500 nucleotides long and play roles in various biological processes, like gene regulation.
On the new page that you open, details about the gene are organized into familiar categories. But first, let’s take a look at the IDs on the right-hand side of the “summary” section. In particular, we’re interested in the NCBI link. Click this link.
We’re now on the NCBI page for the mouse gene Reg3a, which has the Ensembl ID for this gene. Copy this so you can later use it to look through the dataset!
Return back to the MGI page on the Reg3a gene. Down the left-hand side of the page, you will see sections about the chromosomal location, homology, gene ontology, expression data, and more. Most sections are expanded by default, but you’ll need to expand the “homology” section yourself.
Once this section is expanded, you can find information about possible human homologs to the mouse gene, including alternate names and where the human homolog is located in the human genome.
If you continue scrolling down the page, you can also examine the pathways and processes the gene product is involved in under the “gene ontology” section. Clicking on the blue squares takes you to a page with more information about how that particular gene was assigned to a pathway or molecular process.
Directly underneath the “ontology” section is information about when the gene is expressed during development. You can learn more information by clicking on the blue squares, or by clicking on the links in the upper right-hand corner. These links will take you to other websites.
Looking up genes associated with a human disease
You may not know a particular gene that you want to explore in this dataset, but maybe you do have a human disease or condition that you’re interested in. You can look up mouse genes that might be associated with this disease in the MGI as your starting point.
Open the Mouse Genome Informatics website. Choose the tab that says “Human-Mouse: Disease Connection”.
You’ll then want to choose “Disease or Phenotype Name” from the drop-down menu on the left and type the disease of interest into the search bar on the right. For this example, let’s look up “pancreatic inflammation”.
After a brief pause (in which you may see graphics of a human and mouse running on wheels), a new page will open that lists human genes associated with pancreatic inflammation, as well as the mouse orthologs for those genes. This graphic will also indicate the molecular function these genes are associated with. You can click on any of the mouse genes to explore further. Let’s click on the top hit, Abcb1a.
You should now be on a page that contains all sorts of information about the mouse gene Abcb1a. In the “summary” section up top, you can click on the NCBI ID to get the Ensembl mouse ID for this gene.
Specifically, you will find the Ensemble ID for the mouse gene Abcb1a under the “See Related” section. Copy this so you can later use it to look through the dataset!
Return back to the MGI page on the Abcb1a gene. Down the left-hand side of the page, you will see sections about the chromosomal location, homology, gene ontology, expression data, and more. Most sections are expanded by default, but you’ll need to expand the “homology” section yourself.
Once this section is expanded, you can find information about possible human homologs to the mouse gene, including alternate names and where the human homolog is located in the human genome.
If you continue scrolling down the page, you can also examine the pathways and processes the gene product is involved in under the “gene ontology” section. Clicking on the blue squares takes you to a page with more information about how that particular gene was assigned to a pathway or molecular process.
Directly underneath the “ontology” section is information about when the gene is expressed during development. You can learn more information by clicking on the blue squares, or by clicking on the links in the upper right-hand corner. These links will take you to other websites.
Now that you have your genes of interest, you’re ready to start working with the dataset.
Exploring the gene expression count data
Let’s start by looking at a dataset that includes information on how many times each gene was expressed in a sample. This is the gene expression count dataset.
This dataset gives us an idea of the scale of gene expression - for example, which genes are expressed the most? Which genes are expressed the least? This is particularly useful when you want to know if a gene is usually turned on or off in a sample.
Package Install and Load
We will need to install the tidyverse
package for this activity.
Type the following into the SciServer console and press return to run the code.
install.packages("tidyverse")
Next, we will load the package so it’s ready to use:
library(tidyverse)
Packages are collections of R code, data, and documentation that extend the base functionality of R. Think of them like “expansion packs” on top of your basic R software.
Packages are developed by the R community and made available through repositories like CRAN (Comprehensive R Archive Network), Bioconductor, and GitHub. They are especially useful if you want to do a specialized kind of analysis, such as genomic analysis!
We use the library
command to load and attach packages to the R environment. This means links the package you downloaded to your current session of R.
The “tidyverse” package that you loaded is useful for loading, wrangling, and exploring data.
Loading the gene expression count data
The gene expression count data for this project is stored online. R has the ability to load datasets from URLs with the tidyverse’s read_csv
command. Let’s load the gene expression count data for the samples that come from just the control-type mice. This contains all the prefrontal cortex and striatum samples taken from control-type mice (so, two brain regions, only one mouse type).
Remember, R code always follows the same basic format! In this case, you are telling R to open the csv file that exists at a URL and save it as an object called “gutbrain_genes”.
You can load this dataset into R using this code.
<- read_csv("https://genomicseducation.org/data/mouse_gutbrain_de_counts_controls.csv") gutbrain_genes
Rows: 55421 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): GeneID
dbl (58): SRR6652471, SRR6652507, SRR6652508, SRR6652511, SRR6652512, SRR665...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Remember, R code always follows the same basic format! In this case, you are telling R to open the csv file that exists at a URL and save it as an object called “gutbrain_genes”.
When we load this dataset, we see it has 59 columns and 55,421 rows. Each row represents a gene. Each column is an individual mouse (sample), except for the first column (“GeneID”, which is the Ensembl gene IDs).
Here are the URLs for the other gene expression count datasets for this project:
Only ASD-type mouse samples (contains both prefrontal cortex and striatum; two brain regions, only one mouse type): https://genomicseducation.org/data/mouse_gutbrain_de_counts_asd.csv
Only prefrontal cortex samples (contains both ASD-type and control-type mice; two mice types, only one brain region): https://genomicseducation.org/data/mouse_gutbrain_de_counts_prefrontalcortex.csv
Only striatum samples (contains both ASD-type and control-type mice; two mice types, only one brain region): https://genomicseducation.org/data/mouse_gutbrain_de_counts_striatum.csv
Exploring and summarizing count data
When you first open the gene expression data, you want to verify that the data has loaded correctly. You can do this with the head
and tail
commands. Remember, head
prints the first 6 rows of a dataset, while tail
prints the last 6 rows.
head(gutbrain_genes)
# A tibble: 6 × 59
GeneID SRR6652471 SRR6652507 SRR6652508 SRR6652511 SRR6652512 SRR6652513
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG0000… 0 0 0 0 0 138
2 ENSMUSG0000… 0 0 0 0 0 0
3 ENSMUSG0000… 0 0 0 0 0 0
4 ENSMUSG0000… 0 0 0 0 0 0
5 ENSMUSG0000… 0 0 0 0 0 0
6 ENSMUSG0000… 0 0 0 0 0 0
# ℹ 52 more variables: SRR6652514 <dbl>, SRR6652457 <dbl>, SRR6652458 <dbl>,
# SRR6652459 <dbl>, SRR6652460 <dbl>, SRR6652465 <dbl>, SRR6652466 <dbl>,
# SRR6652472 <dbl>, SRR6652475 <dbl>, SRR6652476 <dbl>, SRR6652477 <dbl>,
# SRR6652478 <dbl>, SRR6652483 <dbl>, SRR6652484 <dbl>, SRR6652489 <dbl>,
# SRR6652490 <dbl>, SRR6652493 <dbl>, SRR6652494 <dbl>, SRR6652495 <dbl>,
# SRR6652496 <dbl>, SRR6652501 <dbl>, SRR6652502 <dbl>, SRR6652519 <dbl>,
# SRR6652520 <dbl>, SRR6652443 <dbl>, SRR6652389 <dbl>, SRR6652390 <dbl>, …
tail(gutbrain_genes)
# A tibble: 6 × 59
GeneID SRR6652471 SRR6652507 SRR6652508 SRR6652511 SRR6652512 SRR6652513
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG0000… 0 0 0 0 0 0
2 ENSMUSG0000… 17809 4859 10099 9495 10105 11322
3 ENSMUSG0000… 5384 1952 4737 4702 4385 3985
4 ENSMUSG0000… 62645 30726 49800 57108 58010 35727
5 ENSMUSG0000… 100 239 34 131 0 454
6 ENSMUSG0000… 1510 537 2655 1769 2278 1810
# ℹ 52 more variables: SRR6652514 <dbl>, SRR6652457 <dbl>, SRR6652458 <dbl>,
# SRR6652459 <dbl>, SRR6652460 <dbl>, SRR6652465 <dbl>, SRR6652466 <dbl>,
# SRR6652472 <dbl>, SRR6652475 <dbl>, SRR6652476 <dbl>, SRR6652477 <dbl>,
# SRR6652478 <dbl>, SRR6652483 <dbl>, SRR6652484 <dbl>, SRR6652489 <dbl>,
# SRR6652490 <dbl>, SRR6652493 <dbl>, SRR6652494 <dbl>, SRR6652495 <dbl>,
# SRR6652496 <dbl>, SRR6652501 <dbl>, SRR6652502 <dbl>, SRR6652519 <dbl>,
# SRR6652520 <dbl>, SRR6652443 <dbl>, SRR6652389 <dbl>, SRR6652390 <dbl>, …
You may want to explore the distribution of your count data using summary
. This command lets you see the mean, median, minimum, and maximum for each numeric column. You can specify which column you’d like to get summary data for, or look at summary data for all the columns.
summary(gutbrain_genes)
GeneID SRR6652471 SRR6652507 SRR6652508
Length:55421 Min. : 0 Min. : 0 Min. : 0
Class :character 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Mode :character Median : 0 Median : 0 Median : 0
Mean : 33443 Mean : 20337 Mean : 37955
3rd Qu.: 6070 3rd Qu.: 3741 3rd Qu.: 6925
Max. :81600128 Max. :36770209 Max. :67440774
SRR6652511 SRR6652512 SRR6652513 SRR6652514
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 0 Median : 0
Mean : 30073 Mean : 30832 Mean : 28011 Mean : 29367
3rd Qu.: 5563 3rd Qu.: 5682 3rd Qu.: 5237 3rd Qu.: 5438
Max. :55736007 Max. :57074021 Max. :72009900 Max. :74965569
SRR6652457 SRR6652458 SRR6652459 SRR6652460
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 11 Median : 0 Median : 0
Mean : 32355 Mean : 33191 Mean : 29103 Mean : 30476
3rd Qu.: 6193 3rd Qu.: 6355 3rd Qu.: 5262 3rd Qu.: 5582
Max. :49593013 Max. :51032545 Max. :71526446 Max. :74430264
SRR6652465 SRR6652466 SRR6652472 SRR6652475
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 0 Median : 0
Mean : 29997 Mean : 31385 Mean : 34443 Mean : 34381
3rd Qu.: 5546 3rd Qu.: 5852 3rd Qu.: 6252 3rd Qu.: 6475
Max. :57531013 Max. :59972077 Max. :84369893 Max. :77956051
SRR6652476 SRR6652477 SRR6652478 SRR6652483
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 12 Median : 14 Median : 0
Mean : 35388 Mean : 33333 Mean : 35008 Mean : 31335
3rd Qu.: 6603 3rd Qu.: 6433 3rd Qu.: 6701 3rd Qu.: 5665
Max. :80243487 Max. :65202237 Max. :68172612 Max. :52622462
SRR6652484 SRR6652489 SRR6652490 SRR6652493
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 44 Median : 0
Mean : 32953 Mean : 19139 Mean : 37176 Mean : 26957
3rd Qu.: 5923 3rd Qu.: 3847 3rd Qu.: 7516 3rd Qu.: 4681
Max. :55138668 Max. :28214915 Max. :53848429 Max. :63604032
SRR6652494 SRR6652495 SRR6652496 SRR6652501
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 0 Median : 0
Mean : 28186 Mean : 17638 Mean : 33163 Mean : 29664
3rd Qu.: 4972 3rd Qu.: 3301 3rd Qu.: 6217 3rd Qu.: 5642
Max. :66172940 Max. :39294302 Max. :72993683 Max. :58327992
SRR6652502 SRR6652519 SRR6652520 SRR6652443
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 23 Median : 39 Median : 0
Mean : 30371 Mean : 32359 Mean : 33115 Mean : 29330
3rd Qu.: 5727 3rd Qu.: 6437 3rd Qu.: 6604 3rd Qu.: 4840
Max. :59655071 Max. :36338269 Max. :37222137 Max. :67082796
SRR6652389 SRR6652390 SRR6652391 SRR6652392
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 50 Median : 0 Median : 0
Mean : 25002 Mean : 48996 Mean : 28239 Mean : 29750
3rd Qu.: 5037 3rd Qu.: 9909 3rd Qu.: 5075 3rd Qu.: 5427
Max. :40000312 Max. :77074660 Max. :43300682 Max. :45091266
SRR6652397 SRR6652398 SRR6652403 SRR6652404
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 14 Median : 31
Mean : 37571 Mean : 39631 Mean : 34742 Mean : 35708
3rd Qu.: 6859 3rd Qu.: 7236 3rd Qu.: 6473 3rd Qu.: 6678
Max. :69988736 Max. :73339385 Max. :55633881 Max. :57265938
SRR6652407 SRR6652408 SRR6652413 SRR6652414
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 0 Median : 0
Mean : 29228 Mean : 30630 Mean : 27754 Mean : 28472
3rd Qu.: 5057 3rd Qu.: 5340 3rd Qu.: 5083 3rd Qu.: 5211
Max. :50765707 Max. :53077539 Max. :40980289 Max. :42351272
SRR6652419 SRR6652420 SRR6652423 SRR6652424
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 5 Median : 9 Median : 0 Median : 0
Mean : 33108 Mean : 34896 Mean : 34150 Mean : 35784
3rd Qu.: 6065 3rd Qu.: 6364 3rd Qu.: 6432 3rd Qu.: 6716
Max. :54676908 Max. :57408036 Max. :63232567 Max. :65888146
SRR6652425 SRR6652426 SRR6652431 SRR6652432
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 0 Median : 0
Mean : 18147 Mean : 33256 Mean : 26368 Mean : 27580
3rd Qu.: 3160 3rd Qu.: 5892 3rd Qu.: 4718 3rd Qu.: 4987
Max. :37997986 Max. :68476058 Max. :42687318 Max. :44102839
SRR6652437 SRR6652438 SRR6652441 SRR6652442
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 0 Median : 0
Mean : 19335 Mean : 36501 Mean : 30391 Mean : 32071
3rd Qu.: 3431 3rd Qu.: 6379 3rd Qu.: 5834 3rd Qu.: 6181
Max. :33456292 Max. :62315596 Max. :49428022 Max. :52047884
SRR6652444 SRR6652449 SRR6652450
Min. : 0 Min. : 0 Min. : 0
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
Median : 0 Median : 0 Median : 0
Mean : 30124 Mean : 20413 Mean : 39897
3rd Qu.: 5053 3rd Qu.: 3670 3rd Qu.: 7289
Max. :68948371 Max. :32615585 Max. :63030046
Plotting histograms and dealing with zero counts
A histogram is a good way to visually explore the distribution of the read counts, as well as get an idea of how many genes have few or no reads.
Before we make the histogram, we actually need to sum up the total reads for each gene (row). We can do this in R using the command rowSums
. Remember, our dataset has 59 columns, but only columns 2 - 59 contain count data for each gene. The first column is the “GeneID”. So we need to tell R to do the row sum for columns 2-59, then save it as a new column called “total_counts”.
$total_counts <- rowSums(gutbrain_genes[2:59]) gutbrain_genes
Now we can make the histogram using the histogram
command. Use a log10 scale on the x-axis to plot “total_counts”.
hist(log10(gutbrain_genes$total_counts), main="Log10 Read Counts for Mouse RNA-seq sample")
You may choose to deal with zero-count genes by adding 0.1 to each count before the log10 transformation.
$total_counts_trans <- gutbrain_genes$total_counts + 0.1
gutbrain_genes
hist(log10(gutbrain_genes$total_counts_trans), main="Transformed Log10 Read Counts for Mouse RNA-seq sample")
Looking at read count for a single gene
You may want to look at the read count for a specific gene in your dataset. This can be done with the filter
command. Let’s look at the read counts for the Reg3a gene, across all mice. Remember, this is the gene we looked up earlier in the MGI database. The Ensembl gene ID is “ENSMUSG00000079516”.
filter(gutbrain_genes, GeneID == "ENSMUSG00000079516")
Visualize gene counts of multiple genes across groups
We can also look at the differential gene expression of a set of genes at once! This is usually done by creating a heatmap. In RNA-seq analyses, we usually put individual samples on the x-axis and normalized gene counts on the y-axis.
When creating heatmaps, we want to work with a dataset that includes normalized gene expression data for each gene. We loaded a dataset like this above that we called gutbrain_genes
. If you don’t have an object in your environment called this, scroll back up to the “Loading the gene expression count data” section and run the code again.
The dataset you loaded has expression data for 55,421 different genes, which is too many to visualize. (You can try, but R will usually throw an error if you do.) In general, though, we are mostly interested in looking at the visualization of a particular group of genes, or a gene set.
Let’s say we have identified 6 mouse genes that have been linked to the development of motor neurons. We’re interested in whether these genes show differential gene expression between the striatum and the prefrontal cortex. We can use the filter
command to pull out normalized gene expression data for just these six genes.
These are the genes (with gene IDs) that we’re including in our gene set:
- Sema3a: ENSMUSG00000028883
- Mapk8: ENSMUSG00000021936
- Nrcam: ENSMUSG00000020598
- Dlg4: ENSMUSG00000020886
- Slit1: ENSMUSG00000025020
- Creb1: ENSMUSG00000025958
First, create a vector that contains these gene IDs. (You can replace the gene IDs with the ones for your genes of interest, but make sure you use Ensembl gene IDs! They should all start with “ENSMUSG”.)
Each gene ID should have quotation marks around it, and the different IDs should be separated using a comma.
<- c("ENSMUSG00000028883", "ENSMUSG00000021936", "ENSMUSG00000020598", "ENSMUSG00000020886", "ENSMUSG00000025020", "ENSMUSG00000025958") gene_set
Now we simply filter the gutbrain_genes
dataset so that we only keep the rows for the gene IDs in our gene set. Remember, we want to use the filter
command. The combination of filter
and %in%
tells R to only keep those rows in the gutbrain_genes
dataset if the gene ID is also found in our gene list we made earlier.
<- filter(gutbrain_genes, gutbrain_genes$GeneID %in% gene_set) gutbrain_motor
Now we have our data subsetted the way we want, but we aren’t quite ready for the heatmap yet. In order for R to make a heatmap from our data, we do have to reformat things a little bit. Right now our data is in “tibble” format, but we want it in a numerical “matrix” format. We can’t switch directly from “tibble” to “matrix” (because the GeneID column is character data, not numerical data). Instead, let’s first convert from “tibble” to a format called “dataframe”.
<- as.data.frame(gutbrain_motor) gutbrain_motor
To our eyes, not much has changed in the format, but it’s very different to R. Now we’re going to tell R to name the rows based on the “GeneID” column, and then delete the current “GeneID” column. We are also going to delete two columns we created earlier, “total_counts” and “total_counts_trans”.
rownames(gutbrain_motor) <- gutbrain_motor$GeneID
$GeneID <- NULL
gutbrain_motor
$total_counts <- NULL
gutbrain_motor
$total_counts_trans <- NULL gutbrain_motor
Great! Now we can change the format of gutbrain_motor_controls
dataset to “matrix” format in order to make our heatmap using the heatmap
function.
<- as.matrix(gutbrain_motor)
gutbrain_motor
heatmap(gutbrain_motor)
Congratulations, you have just created a heatmap! The map itself is a series of stacked boxes. Each square within the heatmap represents the gene count for a particular gene in a particular sample. By default squares with the high gene counts are colored dark red. Squares with the lowest gene counts are colored pale yellow.
By default, R will order the samples (what we have on the x-axis across the bottom) in a way that clusters similar samples together. This is why we see columns of lighter yellow in the center of the heatmap, with darker orange and red columns on the edges. This suggests that there are groups of samples that have higher gene expression, but it’s hard to tell right now if these groups correspond to the prefrontal cortex vs striatum (which is what we’re interested in). It would be helpful if we could tell R to group based on prefrontal cortex vs striatum instead.
Luckily, this is possible! Our dataset is already in the order we want (with all the prefrontal cortex samples first). We can just tell R to use the default ordering of the dataset instead of clustering and rearranging the x-axis.
When these datasets were made, the samples were ordered in a specific way so that you could easily make your heatmaps.
Only control-type mouse samples (contains both prefrontal cortex and striatum; two brain regions, only one mouse type): prefrontal cortex samples first, then the striatum samples
Only ASD-type mouse samples (contains both prefrontal cortex and striatum; two brain regions, only one mouse type): prefrontal cortex samples first, then the striatum samples
Only prefrontal cortex samples (contains both ASD-type and control-type mice; two mice types, only one brain region): ASD-type samples first, then control-type samples
Only striatum samples (contains both ASD-type and control-type mice; two mice types, only one brain region): ASD-type samples first, then control-type samples
But if you ever want to verify the details for a particular sample, you can check out the metadata! This is the data (or information) about each of the samples. The metadata is stored at <mouse_gutbrain_metadata.csv>. You can load it into R using read_csv
.
heatmap(gutbrain_motor, Colv = NA, Rowv = NA)
Perfect! Now we can easily see that four of our six genes seem to be expressed in lower levels by the samples on the right side of the heatmap (the striatum samples), while one gene is expressed in higher levels by the striatum. This is definitely intriguing!
Notice that not all the striatum samples have the same expression for each gene. This is normal variation and could be due to actual difference in gene expression levels or be the result of laboratory methods.
Finally, we can customize the heatmap by changing the color scheme and adding a title. As as example, we’re going to tell R to make a spectrum of colors between white and blue, with white indicating the lowest expression and blue indicating the highest expression, and 16 possible color categories.
You can find an extensive list of all the colors available in R at https://www.datanovia.com/en/blog/awesome-list-of-657-r-color-names/. The general convention in biology is that the lightest colors on a heatmap indicate the lowest values.
heatmap(gutbrain_motor, Colv = NA, Rowv = NA, col = colorRampPalette(c("white", "darkblue"))(16), main = " Motor Neuron Gene List Expression")
Finally, here’s all the code you just ran to create your heatmap in one place, in case you want to run it again.
#define your geneset and make a vector
<- c("ENSMUSG00000028883", "ENSMUSG00000021936", "ENSMUSG00000020598", "ENSMUSG00000020886", "ENSMUSG00000025020", "ENSMUSG00000025958")
gene_set
#create new dataset that only include expression data for your geneset
<- filter(gutbrain_genes, gutbrain_genes$GeneID %in% gene_set)
gutbrain_motor
#change format of dataset from "tibble" to "matrix"
<- as.data.frame(gutbrain_motor)
gutbrain_motor rownames(gutbrain_motor) <- gutbrain_motor$GeneID
$GeneID <- NULL
gutbrain_motor$total_counts <- NULL
gutbrain_motor<- as.matrix(gutbrain_motor)
gutbrain_motor
#create heatmap with custom colors and title
heatmap(gutbrain_motor, Colv = NA, Rowv = NA, col = colorRampPalette(c("white", "darkblue"))(16), main = " Motor Neuron Gene List Expression")
Exploring the differential expression data
Now let’s look a differential expression dataset, which compares gene expression counts between two groups. This dataset gives us an idea how the change in a gene’s expression might be associated with a disorder or a cell type. This is particularly useful for beginning to map the genetic drivers of a particular phenotype.
Package Install and Load
Before you start, make sure to load the tidyverse
package! You should still have it installed from last time.
library(tidyverse)
Loading the differential expression data
Let’s say you want to open the dataset that compares gene expression between ASD and control mice in both brain regions and call it asd_vs_c
. We will do this using read_csv
. Copy and paste this command into your console:
<- read_csv("https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol.csv") asd_vs_c
Rows: 55421 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gene
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here are the URLs for all the possible comparisons you can examine with this dataset:
Comparing gene expression between ASD and control mice
Both brain regions: https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol.csv
Prefrontal cortex only: https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol_in_prefrontalcortex.csv
Striatum only: https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol_in_striatum.csv
Comparing gene expression between prefrontal cortex and striatum
All mice: https://genomicseducation.org/data/mouse_gutbrain_de_tissuetype.csv
Only ASD mice: https://genomicseducation.org/data/mouse_gutbrain_de_tissuetype_in_ASDmice.csv
Only control mice: https://genomicseducation.org/data/mouse_gutbrain_de_tissuetype_in_controlmice.csv
Even though this dataset is from a single study, there are still lots of options for independent research questions. You have the option to look at gene expression in control-type vs ASD-type mice, as well as gene expression in both striatum and prefrontal cortex. All the mice were male and sacrificed at the same age (45 days).
Additionally, you can look at gene expression in striatum vs prefrontal cortex in only ASD-type mice or only control-type mice. Likewise, you also have the option of looking at gene expression in ASD and control mice, focusing only on striatum or only on prefrontal cortex.
Are you mostly interested in how gene expression differs between the two brain regions? Are you interested in how neurons develop, or how groups of genes involved in movement might be expressed differently in two different regions? These sorts of questions are best to explore using striatum and prefrontal cortex comparisons.
Are you really curious about differential expression in Autism Spectrum Disorder-type mice compared to control-type mice? Are you curious about potential associations between gene expression differences in known neurodevelopmental genes? These questions and others can be explored using the ASD-type mice and control-type mice comparisons.
Ranking the genes by log2FoldChange
The log2FoldChange
value gives us an idea of which genes show the greatest differential expression between the ASD mice and the control mice. It is a log-transformed ratio of how many gene transcripts were found in ASD-type mice compared to control-type mice. We take the log2 of this ratio (the fold change) because it makes interpretation easier. Genes with no difference in expression between groups have log2FoldChange
values close to zero. Negative log2FoldChange values means a decreased expression (or, a downregulation in gene expression) in the experimental group versus the control group. Positive log2FoldChange values indicate genes that are upregulated, or have increased expression in the experimental group versus the control group.
We can arrange the table based on these values (going from smallest to largest) by copying the following code into your console:
<- arrange(asd_vs_c, log2FoldChange)
asd_vs_c_ascending head(asd_vs_c_ascending)
# A tibble: 6 × 7
gene baseMean log2FoldChange lfcSE stat pvalue padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG00000089657 16.9 -25.5 1.81 -14.1 2.18e-45 7.29e-41
2 ENSMUSG00000102414 14.9 -25.4 1.89 -13.4 4.00e-41 3.81e-37
3 ENSMUSG00000083812 14.6 -25.4 1.80 -14.1 3.82e-45 7.29e-41
4 ENSMUSG00000074445 11.5 -25.0 1.89 -13.3 3.44e-40 2.62e-36
5 ENSMUSG00000094151 11.2 -25.0 1.79 -14.0 2.97e-44 3.77e-40
6 ENSMUSG00000053773 10.6 -24.9 1.99 -12.5 6.66e-36 4.23e-32
The genes at the top of this table have the most negative log2FoldChange values and are downregulated in ASD-type mice.
We can use a similar command to look at the genes with the largest positive log2FoldChange. These are genes that are upregulated in ASD-type mice.
<- arrange(asd_vs_c, desc(log2FoldChange))
asd_vs_c_descending head(asd_vs_c_descending)
# A tibble: 6 × 7
gene baseMean log2FoldChange lfcSE stat pvalue padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG00000102375 7.19 24.1 2.29 10.5 6.88e-26 2.91e-22
2 ENSMUSG00000104583 5.77 24.0 2.31 10.4 2.48e-25 9.47e-22
3 ENSMUSG00000042414 6.81 6.42 2.93 2.19 2.85e- 2 1.00e+ 0
4 ENSMUSG00000048763 6.33 6.32 2.30 2.75 6.04e- 3 1.00e+ 0
5 ENSMUSG00000117911 5.18 6.03 2.77 2.18 2.93e- 2 1.00e+ 0
6 ENSMUSG00000085421 4.74 5.90 2.93 2.01 4.42e- 2 1.00e+ 0
When we’re interpreting the log2FoldChange
, remember that you’ll need to do a little bit of math. A log2FoldChange
value of 1 is a fold change value of 2 (2^1 = 2). Similarly, a log2FoldChange
value of 4 is a fold change value of 8 (2^4 = 8)
If there is a two fold increase (fold change = 2, log2FoldChange
= 1) between groups A and B, then the expression in group A is twice as big as the expression in group B.
If there is a two fold decrease (fold change = 0.5, log2FoldChange
= -1) between groups A and B, then the expression in group A is half as big as the expression in group B (or B is twice as big as A).
Arranging dataset based on padj
Maybe you also want to arrange the genes by whether the estimated differential expression is significant. Instead of log2FoldChange
, you will want to look at padj
. You can use the arrange
command to sort the results to put the smallest padj values first.
<- arrange(asd_vs_c, padj)
asd_vs_c_sorted head(asd_vs_c_sorted)
# A tibble: 6 × 7
gene baseMean log2FoldChange lfcSE stat pvalue padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG00000089657 16.9 -25.5 1.81 -14.1 2.18e-45 7.29e-41
2 ENSMUSG00000083812 14.6 -25.4 1.80 -14.1 3.82e-45 7.29e-41
3 ENSMUSG00000094151 11.2 -25.0 1.79 -14.0 2.97e-44 3.77e-40
4 ENSMUSG00000102414 14.9 -25.4 1.89 -13.4 4.00e-41 3.81e-37
5 ENSMUSG00000074445 11.5 -25.0 1.89 -13.3 3.44e-40 2.62e-36
6 ENSMUSG00000053773 10.6 -24.9 1.99 -12.5 6.66e-36 4.23e-32
Compare differential gene expression of a single across groups
It might be interesting to look at the expression of a gene across our two brain regions. To do this, we will first load the region-specific datasets.
<- read_csv("https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol_in_prefrontalcortex.csv") asd_vs_c_prefrontal
Rows: 55421 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gene
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol_in_striatum.csv") asd_vs_c_striatum
Rows: 55421 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gene
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Then we’ll filter out the gene in which we’re interested from each object. Let’s take a look at gene ENSMUSG00000079516, which is the Reg3a gene we previously looked up on MGI.
<- filter(asd_vs_c_prefrontal, gene == "ENSMUSG00000079516")
reg3a_prefrontal
<- filter(asd_vs_c_striatum, gene == "ENSMUSG00000079516") reg3a_striatum
Finally, take a look at the differential expression of reg3a in each region.
In the prefrontal cortex:
reg3a_prefrontal
# A tibble: 1 × 7
gene baseMean log2FoldChange lfcSE stat pvalue padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG00000079516 10.6 -22.6 2.11 -10.7 7.30e-27 2.63e-23
In the striatum:
reg3a_striatum
# A tibble: 1 × 7
gene baseMean log2FoldChange lfcSE stat pvalue padj
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG00000079516 0.371 2.27 2.96 0.765 0.444 NA
Creating a volcano plot to visualize differential expression of multiple genes
A volcano plot is a great way to visualize differential gene expression between two datasets. These types of plots have log2FoldChange
on the x-axis and a log transformed padj
on the y-axis.
We will first need to do a log transformation of the padj
variable. We do this in order to spread our values out along the y-axis - otherwise, all the dots that represent genes will be too scrunched together.
$padj_trans <- -log10(asd_vs_c$padj) asd_vs_c
Now, we just create a scatterplot. The code for this is kind of complicated, so you will want to paste the code below into your console.
Here’s the code for your basic volcano plot. data
tells R what data to use, x
tells R what the x-axis values should be, and y
tells R what the y-axis values should be. Notice that we can just use the column names for x
and y
!
ggplot(data = asd_vs_c, aes(x = log2FoldChange, y = padj_trans)) +
geom_point()
Warning: Removed 17318 rows containing missing values or values outside the scale range
(`geom_point()`).
We can also color code the genes (points) which are significantly upregulated or downregulated. In order to do this, we will need to create a new column in our dataset that identifies whether a gene is upregulated in ASD-type mice, downregulated in ASD-type mice, or has the same regulation in ASD-type mice compared to control-type mice.
In this new column, we can decide the thresholds that a log2FoldChange
value must meet in order to be considered upregulated or downregulated. We will consider genes to be upregulated if the log2fold change is greater than 0.6 and downregulated if it is less than -0.6. You can change these thresholds to whatever makes sense for you.
log2foldchange is just a representation of a ratio, so the sign indicates the direction of regulation. It’s important to remember which group is your comparison group!
For example, if you are look at regulation in group A compared to group B, you might have a log2foldchange value of 1 (indicating the gene is expressed twice as much in A as it is in B).
However, if you are looking at regulation in group B compared to group A, the log2foldchange value will be -1 (because the gene is expressed half as much in B as it is in A).
The comparison group matters with ratios!
We are also setting a significant p-value as less than 0.05 (but you can also change this based on your best judgment).
We start by creating a new column called diffexpressed
and setting everything to “NO” (as in “no expression differences between ASD-type and control-type mice”). We then specify the genes for which this column should be changed to “UP” or “DOWN” (based on log2FoldChange
and padj
).
To recap: A gene is labeled “UP” if its log2FoldChange value is greater than 0.6 and its adjusted p-value is less than 0.05.
A gene is labeled “DOWN” if its log2FoldChange value is less than -0.6 and its adjusted p-value is less than 0.05.
$diffexpressed <- "NO"
asd_vs_c
$diffexpressed[asd_vs_c$log2FoldChange > 0.6 & asd_vs_c$padj < 0.05] <- "UP"
asd_vs_c
$diffexpressed[asd_vs_c$log2FoldChange < -0.6 & asd_vs_c$padj < 0.05] <- "DOWN"
asd_vs_c
head(asd_vs_c)
# A tibble: 6 × 9
gene baseMean log2FoldChange lfcSE stat pvalue padj padj_trans
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG00000… 16.9 -25.5 1.81 -14.1 2.18e-45 7.29e-41 40.1
2 ENSMUSG00000… 14.6 -25.4 1.80 -14.1 3.82e-45 7.29e-41 40.1
3 ENSMUSG00000… 11.2 -25.0 1.79 -14.0 2.97e-44 3.77e-40 39.4
4 ENSMUSG00000… 14.9 -25.4 1.89 -13.4 4.00e-41 3.81e-37 36.4
5 ENSMUSG00000… 11.5 -25.0 1.89 -13.3 3.44e-40 2.62e-36 35.6
6 ENSMUSG00000… 10.6 -24.9 1.99 -12.5 6.66e-36 4.23e-32 31.4
# ℹ 1 more variable: diffexpressed <chr>
Okay! We can now add some color to the volcano plot we made earlier. We do this by adding a line to our code that tells R to color the data points based on the diffexpressed
column (col = diffexpressed
). We also add a line about the colors we want to use (scale_color_manual
). In this case, we have chosen to make the points for downregulated genes turquoise and the points for upregulated genes gold. The additional command labels
just creates the labels on the legend.
You can find an extensive list of all the colors available in R at https://www.datanovia.com/en/blog/awesome-list-of-657-r-color-names/.
ggplot(data = asd_vs_c, aes(x = log2FoldChange, y = padj_trans, col = diffexpressed)) +
geom_point() +
scale_color_manual(values = c("turquoise", "grey", "gold"),
labels = c("Downregulated", "Not significant", "Upregulated"))
Warning: Removed 17318 rows containing missing values or values outside the scale range
(`geom_point()`).
We can also label genes of interest in our plot. Let’s label the Reg3a gene (the same one we looked up in the MGI database), as well as Pcdh12 (or protocadherin 12). The two gene IDs we need to know are ENSMUSG00000079516 and ENSMUSG00000024440.
We will create another new column in our dataset. We will then tell R In this new column, the genes in our list (Reg3a and Pcdh12) are named, which everything else will have a missing name. When we remake the volcano plot, R will only label those points that do not have missing names in the gene_label
column.
This is going to require a little bit of R trickery and something called a “case_when” statement.
In R, an “case_when” statement tells R to do something only when a condition is true. In our command, R is going through the asd_vs_c
row by row. For each row, we are telling R to first check whether the geneID of that row is “ENSMUSG00000079516”. If it is, then R will fill in the gene_label
column with the gene symbol “Reg3a”. If it is not, R moves onto the next part of our command.
When R moves onto this second command, it’s now checking to see if the geneID for the row is “ENSMUSG00000024440”. If it is, then R fills in the gene_label
column with the gene symbol “Pcdh12”. If it is not, R moves to the third command in the case_when
statement. The final line tells R to fill in the gene_label
column with “NA” for any row that didn’t match the previous commands. This will tell R in the future that the information for this column is missing.
$gene_label <- case_when(
asd_vs_c$gene == "ENSMUSG00000079516" ~ "Reg3a",
asd_vs_c$gene == "ENSMUSG00000024440" ~ "Pcdh12",
asd_vs_cTRUE ~ NA_character_
)
Great! You can check to see what the new column gene_label
looks like by using the head
command.
head(asd_vs_c)
# A tibble: 6 × 10
gene baseMean log2FoldChange lfcSE stat pvalue padj padj_trans
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSMUSG00000… 16.9 -25.5 1.81 -14.1 2.18e-45 7.29e-41 40.1
2 ENSMUSG00000… 14.6 -25.4 1.80 -14.1 3.82e-45 7.29e-41 40.1
3 ENSMUSG00000… 11.2 -25.0 1.79 -14.0 2.97e-44 3.77e-40 39.4
4 ENSMUSG00000… 14.9 -25.4 1.89 -13.4 4.00e-41 3.81e-37 36.4
5 ENSMUSG00000… 11.5 -25.0 1.89 -13.3 3.44e-40 2.62e-36 35.6
6 ENSMUSG00000… 10.6 -24.9 1.99 -12.5 6.66e-36 4.23e-32 31.4
# ℹ 2 more variables: diffexpressed <chr>, gene_label <chr>
Now we just need to remake our volcano plot and add some to code that tells R to label the points we’re interested in. The command label = gene_label
goes right after the code that tells R how to color the points. We also need to add something at the end of our code that tells R to put the labels on top of the plot that we’ve made. That’s the + geom_text_repel(max.overlaps = Inf)
command.
In order for this command to work, we do need to install and load a new library called ggrepel
.
install.packages('ggrepel')
Installing package into '/usr/local/lib/R/site-library'
(as 'lib' is unspecified)
library(ggrepel)
Excellent. Now that we have installed and loaded this package (which you only need to do once for each R session!), we can run the code to create the volcano plot. Remember, we added label = gene_label
to tell R to add labels to the points, and geom_text_repel(max.overlaps = Inf)
to tell R to put the labels on top of the plot so we can see them.
ggplot(data = asd_vs_c, aes(x = log2FoldChange, y = padj_trans, col = diffexpressed, label = gene_label)) +
geom_point() +
scale_color_manual(values = c("turquoise", "grey", "gold"),
labels = c("Downregulated", "Not significant", "Upregulated")) +
geom_text_repel(max.overlaps = Inf)
Warning: Removed 17318 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 55419 rows containing missing values or values outside the scale range
(`geom_text_repel()`).
You may want to add a title to your plot. This can be done with the ggtitle
command.
ggplot(data = asd_vs_c, aes(x = log2FoldChange, y = padj_trans, col = diffexpressed, label = gene_label)) +
geom_point() +
scale_color_manual(values = c("turquoise", "grey", "gold"),
labels = c("Downregulated", "Not significant", "Upregulated")) +
geom_text_repel(max.overlaps = Inf) +
ggtitle("Differential Expression in Mouse Prefrontal Cortex")
Warning: Removed 17318 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 55419 rows containing missing values or values outside the scale range
(`geom_text_repel()`).
Here’s all the code we used to build a volcano plot in one place so that you can easily run it again (and make modifications).
#log transform the padj values so the y-axis is formatted for plotting
$padj_trans <- -log10(asd_vs_c$padj)
asd_vs_c
#Create the diffexpressed column so that we can color the upregulated and downregulated genes on the plot
$diffexpressed <- "NO"
asd_vs_c
$diffexpressed[asd_vs_c$log2FoldChange > 0.6 & asd_vs_c$padj < 0.05] <- "UP"
asd_vs_c
$diffexpressed[asd_vs_c$log2FoldChange < -0.6 & asd_vs_c$padj < 0.05] <- "DOWN"
asd_vs_c
#Create the gene_label column so that we can label individual points
$gene_label <- case_when(
asd_vs_c$gene == "ENSMUSG00000079516" ~ "Reg3a",
asd_vs_c$gene == "ENSMUSG00000024440" ~ "Pcdh12",
asd_vs_cTRUE ~ NA_character_
)
#Make the final plot
ggplot(data = asd_vs_c, aes(x = log2FoldChange, y = padj_trans, col = diffexpressed, label = gene_label)) +
geom_point() +
scale_color_manual(values = c("turquoise", "grey", "gold"),
labels = c("Downregulated", "Not significant", "Upregulated")) +
geom_text_repel(max.overlaps = Inf) +
ggtitle("Differential Expression in Mouse Prefrontal Cortex")
Warning: Removed 17318 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 55419 rows containing missing values or values outside the scale range
(`geom_text_repel()`).
Running a gene set analysis
You can use the special command runClusterProfiler
to figure out the types of processes genes on your gene list are involved in. You can also create a dotplot to visualize your results. (NOTE: this is only possible if you are using the C-MOOR environment in SciServer.)
You created a gene list in the Visualize gene counts of multiple genes across groups section above. The code from this section is repeated below in case you need to run it again.
<- c("ENSMUSG00000028883", "ENSMUSG00000021936", "ENSMUSG00000020598", "ENSMUSG00000020886", "ENSMUSG00000025020", "ENSMUSG00000025958") gene_set
Then we can create a subset of the asd_vs_c
dataset by filtering out any genes that aren’t on our gene list. Remember, we do this using filter
and %in%
.
<- filter(asd_vs_c, asd_vs_c$gene %in% gene_set) asd_vs_c_motor
Finally, create your cluster profile with the runClusterProfiler
and dotplot
commands.
<- runClusterProfiler(asd_vs_c_sig)
asd_vs_c_clusters
dotplot(asd_vs_c_clusters, showCategory=34, title="asd vs control in prefrontal cortex", font.size=10, label_format = 50)