Batch Correction Tutorial¶

Authors: Hui Ma, Yiming Yang, Rimte Rocher
Date: 2022-04-10
Notebook Source: batch_correction.ipynb

In [1]:
import pegasus as pg

Test Environment¶

Some batch correction methods allow use GPU for acceleration, which we will enable below. Below is a summary on the test environment when compiling this notebook:

CPU Model CPU(s) GPU Model GPU Memory GPU Driver version CUDA version Operating System
AMD Ryzen 7 2700X 16 NVIDIA GeForce 1050 Ti 4GB 470.103.01 11.4 Ubuntu Linux 21.10

Dataset¶

In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, and show how to use the batch correction methods in Pegasus to tackle the batch effects in the data.

The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).

Now load the count matrix:

In [2]:
data = pg.read_input("MantonBM_nonmix_subset.zarr.zip")
data
2022-04-10 17:53:45,594 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded.
2022-04-10 17:53:45,595 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.29s.
Out[2]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 48219 x 36601
    Genome: GRCh38; Modality: rna
    It contains 1 matrix: 'X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel', 'gender'
    var: 'featureid'
    uns: 'genome', 'modality'

'Channel' is the batch key. Each batch is associated with one donor, so there are 8 batches in total.

Sections¶

  • Preprocessing
  • Clustering without Correcting Batch Effects
  • Batch Correction Methods
    • Harmony
    • Integrative-NMF
    • scVI
    • Scanorama
  • Comparing Batch CorrectionMethods

Preprocessing¶

First, preprocess the data. This includes filtration, selecting robust genes, and log-normalization:

In [3]:
pg.qc_metrics(data, min_genes=500, max_genes=6000, mito_prefix='MT-', percent_mito=10)
pg.filter_data(data)
pg.identify_robust_genes(data)
pg.log_norm(data)
2022-04-10 17:53:45,917 - pegasusio.qc_utils - INFO - After filtration, 35465 out of 48219 cell barcodes are kept in UnimodalData object GRCh38-rna.
2022-04-10 17:53:45,917 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.16s.
2022-04-10 17:53:46,303 - pegasus.tools.preprocessing - INFO - After filtration, 25653/36601 genes are kept. Among 25653 genes, 17516 genes are robust.
2022-04-10 17:53:46,304 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.39s.
2022-04-10 17:53:46,821 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.52s.

After quality-control, distribution of cells in all the 8 batches is:

In [4]:
data.obs['Channel'].value_counts()
Out[4]:
MantonBM2_HiSeq_1    4935
MantonBM6_HiSeq_1    4665
MantonBM8_HiSeq_1    4511
MantonBM7_HiSeq_1    4452
MantonBM1_HiSeq_1    4415
MantonBM3_HiSeq_1    4225
MantonBM4_HiSeq_1    4172
MantonBM5_HiSeq_1    4090
Name: Channel, dtype: int64

Clustering without Correcting Batch Effects¶

We first perform downstream steps without considering batch effects. In this way, you can see where the batch effects exist, and moreover, we'll use this result as the baseline when comparing different batch correction methods.

In [5]:
data_baseline = data.copy()
pg.highly_variable_features(data_baseline)
2022-04-10 17:53:47,037 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.11s.
2022-04-10 17:53:47,063 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-04-10 17:53:47,063 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.14s.

In this tutorial, the downstream steps consists of:

  • PCA calculation: by default calculate 50 PCs;
  • kNN graph construction, and Louvain clustering based on it;
  • Calculate 2-dimensinoal UMAP embedding, and show UMAP plot.
In [6]:
pg.pca(data_baseline)
pg.neighbors(data_baseline)
pg.louvain(data_baseline)
pg.umap(data_baseline)
pg.scatter(data_baseline, attrs=['louvain_labels', 'Channel'], basis='umap')
2022-04-10 17:53:51,618 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 4.55s.
2022-04-10 17:53:55,909 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.29s.
2022-04-10 17:53:56,823 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.91s.
2022-04-10 17:53:57,954 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.13s.
2022-04-10 17:54:10,621 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get 19 clusters.
2022-04-10 17:54:10,621 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 13.80s.
2022-04-10 17:54:10,622 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2022-04-10 17:54:10,622 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
UMAP(min_dist=0.5, precomputed_knn=(array([[    0, 30526, 27514, ..., 26911, 222, 25411],
       [    1, 29723, 2651, ..., 4946, 1177, 30829],
       [    2, 35262, 20170, ..., 2478, 34953, 19206],
       ...,
       [35462, 6096, 30824, ..., 15433, 19601, 30227],
       [35463, 34252, 34709, ..., 5765, 32603, 31084],
       [35464, 5379, 35401, ..., 10528, 24551, 27891]]), array([[ 0.       , 4.9841695, 5.4866176, ..., 5.8861804, 5.899298 ,
         5.9012275],
       [ 0.       , 8.27516  , 8.679712 , ..., 9.440161 , 9.484825 ,
         9.490481 ],
       [ 0.       , 4.7937765, 5.125005 , ..., 5.489494 , 5.4935   ,
         5.495704 ],
       ...,
       [ 0.       , 7.368082 , 7.414498 , ..., 8.2814455, 8.314083 ,
         8.34855  ],
       [ 0.       , 8.759402 , 8.7638645, ..., 10.250344 , 10.255535 ,
        10.261412 ],
       [ 0.       , 7.1296797, 7.2653217, ..., 8.02741  , 8.158878 ,
         8.207317 ]], dtype=float32), <pegasus.tools.visualization.DummyNNDescent object at 0x7fe5d373cf70>), random_state=0, verbose=True)
Sun Apr 10 17:54:10 2022 Construct fuzzy simplicial set
Sun Apr 10 17:54:12 2022 Construct embedding
Epochs completed:   0%|            0/200 [00:00]
Sun Apr 10 17:54:28 2022 Finished embedding
2022-04-10 17:54:28,276 - pegasus.tools.visualization - INFO - Function 'umap' finished in 17.65s.