Python API
In this tutorial we will detail the usage of the Python API with real-world data, showing how to integrate tspex with popular packages such as
pandas and seaborn.
Introduction
In this tutorial we are going to use data from the Genotype-Tissue Expression (GTEx) project, which provides a large catalogue of gene expression across 54 human tissues. In order to speed up the calculations and make the figures clearer, we are going to use only five tissues: Bladder, Liver, Lung, Pancreas and Stomach.
import tspex
import pandas as pd
import seaborn as sns
gtex_link = 'https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz'
expression_data = pd.read_csv(gtex_link, sep='\t', index_col='gene_id', skiprows=2)
expression_data = expression_data.loc[:, ['Bladder', 'Liver', 'Lung', 'Pancreas', 'Stomach']]
We'll remove genes that are not expressed in any of the five tissues.
expression_data = expression_data.loc[(expression_data > 0).any(axis=1)]
expression_data.shape
expression_data.head()
The TissueSpecificity class
The usage of the tspex module is centered on the creation and analysis of
TissueSpecificity objects. In order to create these objects, two parameters are required: (1) a pandas DataFrame containing gene expression data and (2) a tissue-specificity metric.
print(tspex.TissueSpecificity.__doc__)
The nonobligatory parameters (
log, transform and threshold) and the object attributes (expression_data and tissue_specificity) will be explained throughout the tutorial.
Tissue-specificity metrics
tspex provides twelve different tissue-specificity metrics that can be divided into two groups:
- General scoring metrics: Describe in a single value how tissue-specific or ubiquitous is a gene across all tissues (
counts, tau, gini, simpson, shannon_specificity, roku_specificity, spm_dpm and js_specificity_dpm).
- Individualized scoring metrics: Quantify how specific is the expression of each gene to each tissue (tsi, zscore, spm and js_specificity).
Example 1: General scoring metric
In this example we are going to quantify the tissue-specificity of each gene using the Tau metric, which is a general scoring metric. Therefore, we will get a single tissue-specificity score per gene.
When creating a
TissueSpecificity object, you can choose to log-transform the expression values before quantifying tissue-specificity.
tso = tspex.TissueSpecificity(expression_data, 'tau', log=True)
The log-transformed expression matrix is stored in the
expression_data attribute, which is a regular pandas DataFrame.
tso.expression_data.head()
The computed tissue-specificity values are stored in the
tissue_specificity attribute that, in this case, is an one-dimensional pandas Series.
tso.tissue_specificity.head()
Any of the Series' methods can be used to inspect
tissue_specificity.
tso.tissue_specificity.quantile([0.5, 0.75, 0.95])
Two visualization methods are available in tspex:
plot_histogram and plot_heatmap.
plot_histogram can be used to verify the distribution of the tissue-specificity values.
tso.plot_histogram()
Whereas
plot_heatmap may be used to visualize the amount of genes that are specific for each tissue.
tso.plot_heatmap(threshold=0.8, sort_genes=True, use_zscore=True, gene_names=False)
Example 2: Individualized scoring metric
In some cases we may be interested in discovering genes that are specifically expressed in one or more tissues of interest. In this situation, general scoring metrics are not be the best choice, as they would only tell us how tissue-specific is each gene, regardless of the tissue.
Accordingly, in this example we are going to quantify the tissue-specificity of each gene using the SPM metric, an individualized scoring metric.
tso = tspex.TissueSpecificity(expression_data, 'spm', log=True)
As the SPM metric computes how specific each gene is in each tissue, the
tissue_specificity attribute is a two-dimensional pandas DataFrame.
tso.tissue_specificity.head()
To inspect the distribution of the tissue-specificity values separately for each tissue, we can use the
violinplot function from the seaborn package.
sns.violinplot(x="variable", y="value", inner=None, scale="count", color='C0',
data=tso.tissue_specificity.melt());
The
query method can be used to select genes that satisfy certain criteria. In this case, we are selecting genes that are preferentially expressed in the bladder and lung.
selected_genes_specificity = tso.tissue_specificity.query('Bladder >= 0.6 & Lung >= 0.6')
selected_genes_specificity.shape
selected_genes_expression = tso.expression_data.reindex(selected_genes_specificity.index)
selected_genes_specificity.head()
sns.clustermap(selected_genes_expression, figsize=(6,6), z_score=0);
Example 3: The transform parameter
When a
TissueSpecificity object is created, the computed tissue-specificity values are transformed so that they fall within 0 (perfectly ubiquitous expression) and 1 (perfectly tissue-specific expression). If the user wants to obtain the untransformed values given by the chosen tissue-specificity metric, he can set the transform argument to False. The following metrics are affected by changes in this parameter: gini, simpson, shannon_specificity, roku_specificity and zscore.
For instance, untransformed values obtained with the
roku_specificity method fall within 0 and log2(number of tissues).
tso = tspex.TissueSpecificity(expression_data, 'roku_specificity', log=True)
tso.tissue_specificity.max()
tso.plot_histogram()
tso = tspex.TissueSpecificity(expression_data, 'roku_specificity', log=True, transform=False)
tso.tissue_specificity.max()
tso.plot_histogram()
Example 4: The threshold parameter of the counts metric
The
counts metric quantifies tissue-specificity as the proportion of tissues above a given expression threshold. Consequently, the threshold parameter directly influences the results obtained when using this method.
In this example we will first set
threshold to 5. Therefore, genes will be considered expressed in any given tissue if their expression is above this value.
tso = tspex.TissueSpecificity(expression_data, 'counts', threshold=5)
sum(tso.tissue_specificity >= 0.8)
tso.plot_heatmap(0.8, sort_genes=True, use_zscore=True, gene_names=False)
If we increase
threshold to 50 fewer genes will be selected.
tso = tspex.TissueSpecificity(expression_data, 'counts', threshold=50)
sum(tso.tissue_specificity >= 0.8)
tso.plot_heatmap(0.8, sort_genes=True, use_zscore=True, gene_names=False)