Author: Hui Ma, Yiming Yang, Rimte Rocher
Date: 2022-03-09
Notebook Source: plotting_tutorial.ipynb
import pegasus as pg
For this plotting tutorial, we provide an analysis result of gene-count matrix dataset on Human Bone Marrow with 8 donors. You can get the data from https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip, or use gsutil to download via its Google bucket URL (gs://terra-featured-workspace/Cumulus/MantonBM_result.zarr.zip):
After downloading, load the file using Pegasus read_input
function:
data = pg.read_input("MantonBM_result.zarr.zip")
data
2022-03-09 00:07:00,912 - pegasusio.readwrite - INFO - zarr file 'MantonBM_result.zarr.zip' is loaded. 2022-03-09 00:07:00,913 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.85s.
MultimodalData object with 1 UnimodalData: 'GRCh38-rna' It currently binds to UnimodalData object GRCh38-rna UnimodalData object with n_obs x n_vars = 35465 x 25653 Genome: GRCh38; Modality: rna It contains 2 matrices: 'X', 'raw.X' It currently binds to matrix 'X' as X obs: 'n_genes', 'Channel', 'gender', 'n_counts', 'percent_mito', 'scale', 'louvain_labels'(cluster), 'anno' var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank' obsm: 'X_diffmap', 'X_fle'(basis), 'X_pca', 'X_pca_harmony', 'X_phi', 'X_tsne'(basis), 'X_umap'(basis), 'diffmap_knn_distances'(knn), 'diffmap_knn_indices'(knn), 'pca_harmony_knn_distances'(knn), 'pca_harmony_knn_indices'(knn) varm: 'means', 'partial_sum', 'de_res' obsp: 'W_diffmap', 'W_pca_harmony' uns: 'genome', 'louvain_resolution', 'modality', 'norm_count', 'pca_features', 'stdzn_max_value', 'PCs', 'diffmap_evals', 'ncells', 'stdzn_mean', 'stdzn_std', '_attr2type', 'df_qcplot', 'pca'
In the following sections, we'll cover Pegasus plotting functions using this dataset. Moreover, for gene plots, the canonical gene markers below will be used:
marker_genes = ['CD38', 'JCHAIN', 'FCGR3A', 'HLA-DPA1', 'CD14', 'CD79A', 'MS4A1', 'CD34', 'TRAC', 'CD3D', 'CD8A',
'CD8B', 'GYPA', 'NKG7', 'CD4', 'SELL', 'CCR7']
The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.
pg.qcviolin shows the effect of quality control more intuitively by presenting the violin plot of cell distribution before and after filtration.
plot_type='gene' shows the number of expressed cells before and after filtration.
pg.qcviolin(data, plot_type='gene', dpi=100)
Quality control stats on number of percentage of mitochondrial genes:
pg.qcviolin(data, plot_type='mito', dpi=100)
The number of UMIs before and after filtration is also an important aspect of quality control.
pg.qcviolin(data, plot_type='count', dpi=100)
Highly Variable Genes (HVG) are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.
Pegasus provides hvfplot
function to generate a scatterplot of genes upon HVG selection. This plot only works for Pegasus-flavor HVGs (i.e. flavor='pegasus'
in Pegasus highly_variable_features
function).
After selecting 2000 HVGs using the Pegasus selection method, the plot below is generated. Each point stands for one gene. Blue points are selected to be HVGs, which account for the majority of variation of the dataset. By default, it prints labels of 20 top HVGs. You can change this number in top_n
parameter.
pg.hvfplot(data, dpi=200)
Composition plot is a bar plot showing the cell compositions (under different conditions) in each cluster. Below is to show the composition of different samples in each Louvain cluster:
fig = pg.compo_plot(data, 'louvain_labels', 'Channel', style = 'frequency')
Composition plot is useful to fast assess library quality and batch effects.
Scatter plot requires at least 2 parameters
For this demonstration, we select annotation and channel as data attributes, and tsne as basis.