Authors: Hui Ma, Yiming Yang, Rimte Rocher
Date: 2022-04-10
Notebook Source: batch_correction.ipynb
import pegasus as pg
Some batch correction methods allow use GPU for acceleration, which we will enable below. Below is a summary on the test environment when compiling this notebook:
CPU Model | CPU(s) | GPU Model | GPU Memory | GPU Driver version | CUDA version | Operating System |
---|---|---|---|---|---|---|
AMD Ryzen 7 2700X | 16 | NVIDIA GeForce 1050 Ti | 4GB | 470.103.01 | 11.4 | Ubuntu Linux 21.10 |
In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, and show how to use the batch correction methods in Pegasus to tackle the batch effects in the data.
The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip
).
Now load the count matrix:
data = pg.read_input("MantonBM_nonmix_subset.zarr.zip")
data
2022-04-10 17:53:45,594 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded. 2022-04-10 17:53:45,595 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.29s.
MultimodalData object with 1 UnimodalData: 'GRCh38-rna' It currently binds to UnimodalData object GRCh38-rna UnimodalData object with n_obs x n_vars = 48219 x 36601 Genome: GRCh38; Modality: rna It contains 1 matrix: 'X' It currently binds to matrix 'X' as X obs: 'n_genes', 'Channel', 'gender' var: 'featureid' uns: 'genome', 'modality'
'Channel'
is the batch key. Each batch is associated with one donor, so there are 8 batches in total.
First, preprocess the data. This includes filtration, selecting robust genes, and log-normalization:
pg.qc_metrics(data, min_genes=500, max_genes=6000, mito_prefix='MT-', percent_mito=10)
pg.filter_data(data)
pg.identify_robust_genes(data)
pg.log_norm(data)
2022-04-10 17:53:45,917 - pegasusio.qc_utils - INFO - After filtration, 35465 out of 48219 cell barcodes are kept in UnimodalData object GRCh38-rna. 2022-04-10 17:53:45,917 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.16s. 2022-04-10 17:53:46,303 - pegasus.tools.preprocessing - INFO - After filtration, 25653/36601 genes are kept. Among 25653 genes, 17516 genes are robust. 2022-04-10 17:53:46,304 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.39s. 2022-04-10 17:53:46,821 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.52s.
After quality-control, distribution of cells in all the 8 batches is:
data.obs['Channel'].value_counts()
MantonBM2_HiSeq_1 4935 MantonBM6_HiSeq_1 4665 MantonBM8_HiSeq_1 4511 MantonBM7_HiSeq_1 4452 MantonBM1_HiSeq_1 4415 MantonBM3_HiSeq_1 4225 MantonBM4_HiSeq_1 4172 MantonBM5_HiSeq_1 4090 Name: Channel, dtype: int64
We first perform downstream steps without considering batch effects. In this way, you can see where the batch effects exist, and moreover, we'll use this result as the baseline when comparing different batch correction methods.
data_baseline = data.copy()
pg.highly_variable_features(data_baseline)
2022-04-10 17:53:47,037 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.11s. 2022-04-10 17:53:47,063 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected. 2022-04-10 17:53:47,063 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.14s.
In this tutorial, the downstream steps consists of:
pg.pca(data_baseline)
pg.neighbors(data_baseline)
pg.louvain(data_baseline)
pg.umap(data_baseline)
pg.scatter(data_baseline, attrs=['louvain_labels', 'Channel'], basis='umap')
2022-04-10 17:53:51,618 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 4.55s. 2022-04-10 17:53:55,909 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.29s. 2022-04-10 17:53:56,823 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.91s. 2022-04-10 17:53:57,954 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.13s. 2022-04-10 17:54:10,621 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get 19 clusters. 2022-04-10 17:54:10,621 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 13.80s. 2022-04-10 17:54:10,622 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required. 2022-04-10 17:54:10,622 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s. UMAP(min_dist=0.5, precomputed_knn=(array([[ 0, 30526, 27514, ..., 26911, 222, 25411], [ 1, 29723, 2651, ..., 4946, 1177, 30829], [ 2, 35262, 20170, ..., 2478, 34953, 19206], ..., [35462, 6096, 30824, ..., 15433, 19601, 30227], [35463, 34252, 34709, ..., 5765, 32603, 31084], [35464, 5379, 35401, ..., 10528, 24551, 27891]]), array([[ 0. , 4.9841695, 5.4866176, ..., 5.8861804, 5.899298 , 5.9012275], [ 0. , 8.27516 , 8.679712 , ..., 9.440161 , 9.484825 , 9.490481 ], [ 0. , 4.7937765, 5.125005 , ..., 5.489494 , 5.4935 , 5.495704 ], ..., [ 0. , 7.368082 , 7.414498 , ..., 8.2814455, 8.314083 , 8.34855 ], [ 0. , 8.759402 , 8.7638645, ..., 10.250344 , 10.255535 , 10.261412 ], [ 0. , 7.1296797, 7.2653217, ..., 8.02741 , 8.158878 , 8.207317 ]], dtype=float32), <pegasus.tools.visualization.DummyNNDescent object at 0x7fe5d373cf70>), random_state=0, verbose=True) Sun Apr 10 17:54:10 2022 Construct fuzzy simplicial set Sun Apr 10 17:54:12 2022 Construct embedding
Epochs completed: 0%| 0/200 [00:00]
Sun Apr 10 17:54:28 2022 Finished embedding 2022-04-10 17:54:28,276 - pegasus.tools.visualization - INFO - Function 'umap' finished in 17.65s.