Python API
In this tutorial we will detail the usage of the Python API with real-world data, showing how to integrate tspex with popular packages such as
pandas
and seaborn
.
Introduction
In this tutorial we are going to use data from the Genotype-Tissue Expression (GTEx) project, which provides a large catalogue of gene expression across 54 human tissues. In order to speed up the calculations and make the figures clearer, we are going to use only five tissues: Bladder, Liver, Lung, Pancreas and Stomach.
import tspex
import pandas as pd
import seaborn as sns
gtex_link = 'https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz'
expression_data = pd.read_csv(gtex_link, sep='\t', index_col='gene_id', skiprows=2)
expression_data = expression_data.loc[:, ['Bladder', 'Liver', 'Lung', 'Pancreas', 'Stomach']]
We'll remove genes that are not expressed in any of the five tissues.
expression_data = expression_data.loc[(expression_data > 0).any(axis=1)]
expression_data.shape
expression_data.head()
The TissueSpecificity
class
The usage of the tspex module is centered on the creation and analysis of
TissueSpecificity
objects. In order to create these objects, two parameters are required: (1) a pandas
DataFrame containing gene expression data and (2) a tissue-specificity metric.
print(tspex.TissueSpecificity.__doc__)
The nonobligatory parameters (
log
, transform
and threshold
) and the object attributes (expression_data
and tissue_specificity
) will be explained throughout the tutorial.
Tissue-specificity metrics
tspex provides twelve different tissue-specificity metrics that can be divided into two groups:
- General scoring metrics: Describe in a single value how tissue-specific or ubiquitous is a gene across all tissues (
counts
, tau
, gini
, simpson
, shannon_specificity
, roku_specificity
, spm_dpm
and js_specificity_dpm
).
- Individualized scoring metrics: Quantify how specific is the expression of each gene to each tissue (tsi
, zscore
, spm
and js_specificity
).
Example 1: General scoring metric
In this example we are going to quantify the tissue-specificity of each gene using the Tau metric, which is a general scoring metric. Therefore, we will get a single tissue-specificity score per gene.
When creating a
TissueSpecificity
object, you can choose to log-transform the expression values before quantifying tissue-specificity.
tso = tspex.TissueSpecificity(expression_data, 'tau', log=True)
The log-transformed expression matrix is stored in the
expression_data
attribute, which is a regular pandas
DataFrame.
tso.expression_data.head()
The computed tissue-specificity values are stored in the
tissue_specificity
attribute that, in this case, is an one-dimensional pandas
Series.
tso.tissue_specificity.head()
Any of the Series' methods can be used to inspect
tissue_specificity
.
tso.tissue_specificity.quantile([0.5, 0.75, 0.95])
Two visualization methods are available in tspex:
plot_histogram
and plot_heatmap
.
plot_histogram
can be used to verify the distribution of the tissue-specificity values.
tso.plot_histogram()
Whereas
plot_heatmap
may be used to visualize the amount of genes that are specific for each tissue.
tso.plot_heatmap(threshold=0.8, sort_genes=True, use_zscore=True, gene_names=False)
Example 2: Individualized scoring metric
In some cases we may be interested in discovering genes that are specifically expressed in one or more tissues of interest. In this situation, general scoring metrics are not be the best choice, as they would only tell us how tissue-specific is each gene, regardless of the tissue.
Accordingly, in this example we are going to quantify the tissue-specificity of each gene using the SPM metric, an individualized scoring metric.
tso = tspex.TissueSpecificity(expression_data, 'spm', log=True)
As the SPM metric computes how specific each gene is in each tissue, the
tissue_specificity
attribute is a two-dimensional pandas
DataFrame.
tso.tissue_specificity.head()
To inspect the distribution of the tissue-specificity values separately for each tissue, we can use the
violinplot
function from the seaborn
package.
sns.violinplot(x="variable", y="value", inner=None, scale="count", color='C0',
data=tso.tissue_specificity.melt());
The
query
method can be used to select genes that satisfy certain criteria. In this case, we are selecting genes that are preferentially expressed in the bladder and lung.
selected_genes_specificity = tso.tissue_specificity.query('Bladder >= 0.6 & Lung >= 0.6')
selected_genes_specificity.shape
selected_genes_expression = tso.expression_data.reindex(selected_genes_specificity.index)
selected_genes_specificity.head()
sns.clustermap(selected_genes_expression, figsize=(6,6), z_score=0);
Example 3: The transform
parameter
When a
TissueSpecificity
object is created, the computed tissue-specificity values are transformed so that they fall within 0 (perfectly ubiquitous expression) and 1 (perfectly tissue-specific expression). If the user wants to obtain the untransformed values given by the chosen tissue-specificity metric, he can set the transform
argument to False
. The following metrics are affected by changes in this parameter: gini
, simpson
, shannon_specificity
, roku_specificity
and zscore
.
For instance, untransformed values obtained with the
roku_specificity
method fall within 0 and log2(number of tissues).
tso = tspex.TissueSpecificity(expression_data, 'roku_specificity', log=True)
tso.tissue_specificity.max()
tso.plot_histogram()
tso = tspex.TissueSpecificity(expression_data, 'roku_specificity', log=True, transform=False)
tso.tissue_specificity.max()
tso.plot_histogram()
Example 4: The threshold
parameter of the counts
metric
The
counts
metric quantifies tissue-specificity as the proportion of tissues above a given expression threshold. Consequently, the threshold
parameter directly influences the results obtained when using this method.
In this example we will first set
threshold
to 5. Therefore, genes will be considered expressed in any given tissue if their expression is above this value.
tso = tspex.TissueSpecificity(expression_data, 'counts', threshold=5)
sum(tso.tissue_specificity >= 0.8)
tso.plot_heatmap(0.8, sort_genes=True, use_zscore=True, gene_names=False)
If we increase
threshold
to 50 fewer genes will be selected.
tso = tspex.TissueSpecificity(expression_data, 'counts', threshold=50)
sum(tso.tissue_specificity >= 0.8)
tso.plot_heatmap(0.8, sort_genes=True, use_zscore=True, gene_names=False)