Author: Bo Li
Date: 2022-03-09
Notebook Source: spatial_analysis.ipynb
This tutorial runs analysis on a 10x Visium mouse brain section dataset with Pegasus.
import numpy as np
import pandas as pd
import pegasusio as io
import pegasus as pg
You can download the data at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/mouse_brain_10x.tar.gz.
After downloading, unzip the tar ball file and load the data into memory:
data = io.read_input('mouse_brain_10x', file_type='visium')
data
2022-03-09 14:44:15,815 - pegasusio.readwrite - INFO - visium file 'mouse_brain_10x' is loaded. 2022-03-09 14:44:15,816 - pegasusio.readwrite - INFO - Function 'read_input' finished in 1.03s.
MultimodalData object with 1 UnimodalData: 'mm10-visium' It currently binds to SpatialData object mm10-visium SpatialData object with n_obs x n_vars = 2702 x 32285 Genome: mm10; Modality: visium It contains 1 matrix: 'X' It currently binds to matrix 'X' as X obs: 'in_tissue', 'array_row', 'array_col' var: 'featureid' obsm: 'X_spatial' uns: 'genome', 'modality' img: 'sample_id', 'image_id', 'data', 'scale_factor', 'spot_diameter'
pg.qc_metrics(data, mito_prefix='mt-')
Then we can view the metrics. For example, the code below shows the 2.5% and 97.5% quantiles for number of genes:
np.percentile(data.obs['n_genes'], [2.5, 97.5])
array([2917.675, 8664.475])
Based on the quantiles above, we filter the data by number of genes between 2.5% and 97.5%, and percent of mito gene expression below 20%:
pg.qc_metrics(data, min_genes = 2917, max_genes = 8664, mito_prefix='mt-', percent_mito=20)
pg.qcviolin(data, plot_type='gene', dpi=100)
pg.qcviolin(data, plot_type='count', dpi=100)
pg.qcviolin(data, plot_type='mito', dpi=100)
Now do the actual filteration below:
pg.filter_data(data)
2022-03-09 14:44:16,785 - pegasusio.qc_utils - INFO - After filtration, 2229 out of 2702 cell barcodes are kept in UnimodalData object mm10-visium. 2022-03-09 14:44:16,785 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.16s.
And identify robust genes:
pg.identify_robust_genes(data)
2022-03-09 14:44:17,051 - pegasus.tools.preprocessing - INFO - After filtration, 21680/32285 genes are kept. Among 21680 genes, 20184 genes are robust. 2022-03-09 14:44:17,052 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.26s.
pg.log_norm(data)
pg.highly_variable_features(data)
pg.pca(data)
pg.neighbors(data)
pg.leiden(data)
pg.umap(data)
2022-03-09 14:44:17,243 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.19s. 2022-03-09 14:44:17,272 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.03s. 2022-03-09 14:44:17,312 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2022-03-09 14:44:17,312 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.07s. 2022-03-09 14:44:17,781 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 0.47s. 2022-03-09 14:44:17,990 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.21s. 2022-03-09 14:44:18,061 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.07s. 2022-03-09 14:44:18,125 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 0.05s. 2022-03-09 14:44:18,348 - pegasus.tools.clustering - INFO - Leiden clustering is done. Get 11 clusters. 2022-03-09 14:44:18,351 - pegasus.tools.clustering - INFO - Function 'leiden' finished in 0.29s. 2022-03-09 14:44:18,352 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required. 2022-03-09 14:44:18,352 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s. 2022-03-09 14:44:18,353 - pegasus.tools.visualization - INFO - Using umap kNN graph because number of cells 2229 is smaller than 4096 or knn_indices is not provided. UMAP(min_dist=0.5, random_state=0, verbose=True) Wed Mar 9 14:44:18 2022 Construct fuzzy simplicial set Wed Mar 9 14:44:20 2022 Finding Nearest Neighbors Wed Mar 9 14:44:22 2022 Finished Nearest Neighbor Search Wed Mar 9 14:44:24 2022 Construct embedding
Wed Mar 9 14:44:27 2022 Finished embedding 2022-03-09 14:44:27,755 - pegasus.tools.visualization - INFO - Function 'umap' finished in 9.40s.
Run the code below to show the UMAP plot of cells colored by cluster labels generated by Leiden algorithm on their PCA embedding:
pg.scatter(data, attrs='leiden_labels')
Run the function below to perform Differential Expression analysis:
pg.de_analysis(data, cluster='leiden_labels')
2022-03-09 14:44:28,443 - pegasus.tools.diff_expr - INFO - CSR matrix is converted to CSC matrix. Time spent = 0.2862s. 2022-03-09 14:44:45,506 - pegasus.tools.diff_expr - INFO - MWU test and AUROC calculation are finished. Time spent = 17.0626s. 2022-03-09 14:44:45,560 - pegasus.tools.diff_expr - INFO - Sufficient statistics are collected. Time spent = 0.0545s. 2022-03-09 14:44:45,620 - pegasus.tools.diff_expr - INFO - Differential expression analysis is finished. 2022-03-09 14:44:45,621 - pegasus.tools.diff_expr - INFO - Function 'de_analysis' finished in 17.47s.
This is to annotate cluster-specific cell types with preset mouse brain markers:
celltype_dict = pg.infer_cell_types(data, markers = 'mouse_brain')
celltype_dict
{'1': [name: GABAergic neuron; score: 0.99; average marker percentage: 87.16%; strong support: (Gad1+,94.82%),(Gad2+,89.00%); weak support: (Slc32a1+,77.67%)], '2': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 90.04%; strong support: (Slc17a7+,100.00%),(Neurod6+,82.21%),(Neurod2+,87.90%), name: GABAergic neuron; score: 0.59; average marker percentage: 85.05%; strong support: (Gad1+,95.02%); weak support: (Slc32a1+,75.09%)], '3': [name: GABAergic neuron; score: 1.00; average marker percentage: 89.38%; strong support: (Gad1+,90.27%),(Gad2+,94.25%),(Slc32a1+,83.63%), name: Oligodendrocyte; score: 1.00; average marker percentage: 89.60%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,96.46%),(Olig1+,99.56%),(Olig2+,65.04%),(Sox10+,76.55%)], '4': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 95.61%; strong support: (Slc17a7+,100.00%),(Neurod6+,95.91%),(Neurod2+,90.91%)], '5': [name: Oligodendrocyte; score: 1.00; average marker percentage: 96.65%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,100.00%),(Olig1+,99.53%),(Olig2+,85.51%),(Sox10+,94.86%), name: Microglia; score: 0.94; average marker percentage: 72.78%; strong support: (C1qb+,81.78%),(Ctss+,80.37%),(Csf1r+,78.04%); weak support: (P2ry12+,50.93%), name: Choroid Coch; score: 0.73; average marker percentage: 18.69%; weak support: (Tgfbi+,18.69%)], '6': [name: GABAergic neuron; score: 1.00; average marker percentage: 93.33%; strong support: (Gad1+,94.29%),(Gad2+,95.71%),(Slc32a1+,90.00%), name: Ependyma; score: 0.68; average marker percentage: 24.76%; weak support: (Ccdc153+,24.76%)], '7': [name: Oligodendrocyte; score: 0.99; average marker percentage: 86.17%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,94.66%),(Olig1+,98.54%); weak support: (Olig2+,56.80%),(Sox10+,66.99%)], '8': [name: Glutamatergic neuron; score: 0.50; average marker percentage: 91.88%; strong support: (Slc17a7+,100.00%),(Neurod2+,83.77%)], '9': [name: Mural; score: 1.00; average marker percentage: 73.76%; strong support: (Rgs5+,80.85%),(Acta2+,66.67%), name: Choroid Coch; score: 1.00; average marker percentage: 41.13%; strong support: (Tgfbi+,41.13%), name: Ependyma; score: 1.00; average marker percentage: 51.06%; strong support: (Ccdc153+,51.06%), name: Smooth muscle cell; score: 1.00; average marker percentage: 66.67%; strong support: (Vtn+,86.52%),(Colec12+,46.81%), name: Astrocyte; score: 0.95; average marker percentage: 79.43%; strong support: (Aqp4+,90.07%),(Gja1+,95.04%); weak support: (F3+,68.79%),(Prex2+,63.83%), name: Endothelial; score: 0.84; average marker percentage: 58.33%; strong support: (Dcn+,69.50%),(Id1+,88.65%); weak support: (Flt1+,60.99%),(Xdh+,14.18%), name: Microglia; score: 0.75; average marker percentage: 88.89%; strong support: (C1qb+,95.04%),(Ctss+,88.65%); weak support: (Csf1r+,82.98%), name: OPC; score: 0.50; average marker percentage: 58.16%; strong support: (Pdgfra+,58.16%), name: Fibroblast; score: 0.50; average marker percentage: 69.50%; strong support: (Dcn+,69.50%)], '10': [name: Astrocyte; score: 0.63; average marker percentage: 80.43%; strong support: (Gja1+,97.83%); weak support: (Aqp4+,84.06%),(F3+,59.42%)], '11': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 92.05%; strong support: (Slc17a7+,100.00%),(Neurod6+,85.38%),(Neurod2+,90.77%)]}
cluster_names = pg.infer_cluster_names(celltype_dict)
pg.annotate(data, name='anno', based_on='leiden_labels', anno_dict=cluster_names)
pg.scatter(data, 'anno', legend_loc='on data')
Besides the UMAP plot, Pegasus also provides spatial function to generate spatial plots. Below is the spatial plot of cells colored by their cell types:
pg.spatial(data, 'anno')
pg.write_output(data, "spatial_results.zarr.zip")
2022-03-09 14:44:48,260 - pegasusio.readwrite - INFO - zarr.zip file 'spatial_results.zarr.zip' is written. 2022-03-09 14:44:48,261 - pegasusio.readwrite - INFO - Function 'write_output' finished in 0.50s.