What it does
Given a counts matrix, or a set of counts files, for example from featureCounts, and optional information about the genes, this tool
produces plots and tables useful in the analysis of differential gene expression.
This tool uses the edgeR quasi-likelihood pipeline (edgeR-quasi) for differential expression analysis. This statistical methodology uses negative binomial generalized linear models, but with F-tests instead of likelihood ratio tests. This method provides stricter error rate control than other negative binomial based pipelines, including the traditional edgeR pipelines or DESeq2. While the limma pipelines are recommended for large-scale datasets, because of their speed and flexibility, the edgeR-quasi pipeline gives better performance in low-count situations. For the data analyzed in this edgeR workflow article ,the edgeR-quasi, limma-voom and limma-trend pipelines are all equally suitable and give similar results.
Inputs
Counts Data:
The counts data can either be input as separate counts files (one sample per file) or a single count matrix (one sample per column). The rows correspond to genes, and columns correspond to the counts for the samples. Values must be tab separated, with the first row containing the sample/column labels and the first column containing the row/gene labels. Gene identifiers can be of any type but must be unique and not repeated within a counts file.
Example - Separate Count Files:
GeneID |
WT1 |
11287 |
1699 |
11298 |
1905 |
11302 |
6 |
11303 |
2099 |
11304 |
356 |
11305 |
2528 |
Example - Single Count Matrix:
GeneID |
WT1 |
WT2 |
WT3 |
Mut1 |
Mut2 |
Mut3 |
11287 |
1699 |
1528 |
1601 |
1463 |
1441 |
1495 |
11298 |
1905 |
1744 |
1834 |
1345 |
1291 |
1346 |
11302 |
6 |
8 |
7 |
5 |
6 |
5 |
11303 |
2099 |
1974 |
2100 |
1574 |
1519 |
1654 |
11304 |
356 |
312 |
337 |
361 |
397 |
346 |
11305 |
2528 |
2438 |
2493 |
1762 |
1942 |
2027 |
Factor Information:
Enter factor names and groups in the tool form, or provide a tab-separated file that has the samples in the same order as listed in the columns of the counts matrix. The second column should contain the primary factor levels (e.g. WT, Mut) with optional additional columns for any secondary factors.
Example:
Sample |
Genotype |
Batch |
WT1 |
WT |
b1 |
WT2 |
WT |
b2 |
WT3 |
WT |
b3 |
Mut1 |
Mut |
b1 |
Mut2 |
Mut |
b2 |
Mut3 |
Mut |
b3 |
Factor Name: The name of the experimental factor being investigated e.g. Genotype, Treatment. One factor must be entered and spaces must not be used. Optionally, additional factors can be included, these are variables that might influence your experiment e.g. Batch, Gender, Subject. If additional factors are entered, edgeR will fit an additive linear model.
Groups: The names of the groups for the factor. These must be entered in the same order as the samples (to which the groups correspond) are listed in the columns of the counts matrix. Spaces must not be used and if entered into the tool form above, the values should be separated by commas.
Gene Annotations:
Optional input for gene annotations, this can contain more
information about the genes than just an ID number. The annotations will
be available in the differential expression results table and the optional normalised counts table. The file must contain a header row and have the gene IDs in the first column. The number of rows should match that of the counts files, add NA for any gene IDs with no annotation. The Galaxy tool annotateMyIDs can be used to obtain annotations for human, mouse, fly and zebrafish.
Example:
GeneID |
Symbol |
GeneName |
11287 |
Pzp |
pregnancy zone protein |
11298 |
Aanat |
arylalkylamine N-acetyltransferase |
11302 |
Aatk |
apoptosis-associated tyrosine kinase |
11303 |
Abca1 |
ATP-binding cassette, sub-family A (ABC1), member 1 |
11304 |
Abca4 |
ATP-binding cassette, sub-family A (ABC1), member 4 |
11305 |
Abca2 |
ATP-binding cassette, sub-family A (ABC1), member 2 |
Contrasts of Interest:
The contrasts you wish to make between levels.
A common contrast would be a simple difference between two levels: "Mut-WT"
represents the difference between the mutant and wild type genotypes.
Multiple contrasts must be entered separately using the Insert Contrast button, spaces must not be used.
Filter Low Counts:
Genes with very low counts across all libraries provide little evidence for differential expression.
In the biological point of view, a gene must be expressed at some minimal level before
it is likely to be translated into a protein or to be biologically important. In addition, the
pronounced discreteness of these counts interferes with some of the statistical approximations
that are used later in the pipeline. These genes should be filtered out prior to further
analysis.
As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples
for any of the conditions. Users can set their own definition of genes being expressed. Usually
a gene is required to have a count of 5-10 in a library to be considered expressed in that
library. Users should also filter with count-per-million (CPM) rather than filtering on the
counts directly, as the latter does not account for differences in library sizes between samples.
Option to ignore the genes that do not show significant levels of
expression, this filtering is dependent on two criteria: CPM/count and number of samples. You can specify to filter on CPM (Minimum CPM) or count (Minimum Count) values:
- Minimum CPM: This is the minimum count per million that a gene must have in at
least the number of samples specified under Minimum Samples.
- Minimum Count: This is the minimum count that a gene must have. It can be combined with either Filter
on Total Count or Minimum Samples.
- Filter on Total Count: This can be used with the Minimum Count filter to keep genes
with a minimum total read count.
- Minimum Samples: This is the number of samples in which the Minimum CPM/Count
requirement must be met in order for that gene to be kept.
If the Minimum Samples filter is applied, only genes that exhibit a CPM/count greater than the required amount in at least the number of samples specified will be used for analysis. Care should be taken to
ensure that the sample requirement is appropriate. In the case of an experiment
with two experimental groups each with two members, if there is a change from
insignificant CPM/count to significant CPM/count but the sample requirement is set to 3,
then this will cause that gene to fail the criteria. When in doubt simply do not
filter or consult the edgeR workflow article for filtering recommendations.
Advanced Options:
By default error rate for multiple testing is controlled using Benjamini and
Hochberg's false discovery rate control at a threshold value of 0.05. However
there are options to change this to custom values.
- Minimum log2-fold-change Required:
In addition to meeting the requirement for the adjusted statistic for
multiple testing, the observation must have an absolute log2-fold-change
greater than this threshold to be considered significant, thus highlighted
in the MD plot.
- Adjusted Threshold:
Set the threshold for the resulting value of the multiple testing control
method. Only observations whose statistic falls below this value is
considered significant, thus highlighted in the MD plot.
- P-Value Adjustment Method:
Change the multiple testing control method, the options are BH(1995) and
BY(2001) which are both false discovery rate controls. There is also
Holm(1979) which is a method for family-wise error rate control.
Normalisation Method:
The most obvious technical factor that affects the read counts, other than gene expression
levels, is the sequencing depth of each RNA sample. edgeR adjusts any differential expression
analysis for varying sequencing depths as represented by differing library sizes. This is
part of the basic modeling procedure and flows automatically into fold-change or p-value
calculations. It is always present, and doesn’t require any user intervention.
The second most important technical influence on differential expression is one that is less
obvious. RNA-seq provides a measure of the relative abundance of each gene in each RNA
sample, but does not provide any measure of the total RNA output on a per-cell basis.
This commonly becomes important when a small number of genes are very highly expressed
in one sample, but not in another. The highly expressed genes can consume a substantial
proportion of the total library size, causing the remaining genes to be under-sampled in that
sample. Unless this RNA composition effect is adjusted for, the remaining genes may falsely
appear to be down-regulated in that sample . The edgeR calcNormFactors function normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes. The default method for computing these scale factors uses a trimmed mean of M values (TMM) between each pair of samples. We call the product of the original library size and the scaling factor the effective library size. The effective library size replaces the original library size in all downsteam analyses. TMM is the recommended method for most RNA-Seq data where the majority (more than half) of the genes are believed not differentially expressed between any pair of the samples. You can change the normalisation method under Advanced Options above. For more information, see the calcNormFactors section in the edgeR User's Guide.
Robust Settings
Option to use robust settings. Using robust settings (robust=TRUE) with the edgeR estimateDisp and glmQLFit functions is usually recommended to protect against outlier genes. This is turned on by default. Note that it is only used with the quasi-likelihood F test method. For more information, see the edgeR workflow article.
Test Method
Option to use the likelihood ratio test instead of the quasi-likelihood F test. For more information, see the edgeR workflow article.
Outputs
This tool outputs
- a table of differentially expressed genes for each contrast of interest
- a HTML report with plots and additional information
Optionally, under Output Options you can choose to output
- a normalised counts table
- the R script used by this tool
- an RData file
Citations
Please try to cite the appropriate articles when you publish results obtained using software, as such citation is the main means by which the authors receive credit for their work. For the edgeR method itself, please cite Robinson et al., 2010, and for this tool (which was developed from the Galaxy limma-voom tool) please cite Liu et al., 2015.