import pegasus as pg


data = pg.read_input("MantonBM_result.zarr.zip")
data

2022-03-09 00:02:09,948 - pegasusio.readwrite - INFO - zarr file 'MantonBM_result.zarr.zip' is loaded.
2022-03-09 00:02:09,949 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.77s.

MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 35465 x 25653
    Genome: GRCh38; Modality: rna
    It contains 2 matrices: 'X', 'raw.X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel', 'gender', 'n_counts', 'percent_mito', 'scale', 'louvain_labels'(cluster), 'anno'
    var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank'
    obsm: 'X_diffmap', 'X_fle'(basis), 'X_pca', 'X_pca_harmony', 'X_phi', 'X_tsne'(basis), 'X_umap'(basis), 'diffmap_knn_distances'(knn), 'diffmap_knn_indices'(knn), 'pca_harmony_knn_distances'(knn), 'pca_harmony_knn_indices'(knn)
    varm: 'means', 'partial_sum', 'de_res'
    obsp: 'W_diffmap', 'W_pca_harmony'
    uns: 'genome', 'louvain_resolution', 'modality', 'norm_count', 'pca_features', 'stdzn_max_value', 'PCs', 'diffmap_evals', 'ncells', 'stdzn_mean', 'stdzn_std', '_attr2type', 'df_qcplot', 'pca'


pg.infer_doublets(data, channel_attr = 'Channel', clust_attr = 'anno')

2022-03-09 00:02:10,122 - pegasus.tools.preprocessing - INFO - After filtration, 21108/25653 genes are kept. Among 21108 genes, 17375 genes are robust.
2022-03-09 00:02:10,123 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.14s.
2022-03-09 00:02:10,187 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.06s.
2022-03-09 00:02:10,200 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:10,231 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:10,231 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.04s.
2022-03-09 00:02:12,346 - pegasus.tools.doublet_detection - INFO - Sample MantonBM1_HiSeq_1: doublet threshold = 0.1606; total cells = 4415; neotypic doublet rate in simulation = 44.52%; neotypic doublet rate = 2.22%.
2022-03-09 00:02:12,929 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.70s.
2022-03-09 00:02:13,699 - pegasus.tools.preprocessing - INFO - After filtration, 20329/25653 genes are kept. Among 20329 genes, 16739 genes are robust.
2022-03-09 00:02:13,700 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.17s.
2022-03-09 00:02:13,771 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-03-09 00:02:13,784 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:13,819 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:13,819 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-03-09 00:02:16,194 - pegasus.tools.doublet_detection - INFO - Sample MantonBM2_HiSeq_1: doublet threshold = 0.1401; total cells = 4935; neotypic doublet rate in simulation = 57.44%; neotypic doublet rate = 3.12%.
2022-03-09 00:02:16,607 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.79s.
2022-03-09 00:02:17,384 - pegasus.tools.preprocessing - INFO - After filtration, 20231/25653 genes are kept. Among 20231 genes, 16424 genes are robust.
2022-03-09 00:02:17,385 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.18s.
2022-03-09 00:02:17,452 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-03-09 00:02:17,464 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:17,501 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:17,502 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-03-09 00:02:19,696 - pegasus.tools.doublet_detection - INFO - Sample MantonBM3_HiSeq_1: doublet threshold = 0.1495; total cells = 4225; neotypic doublet rate in simulation = 41.70%; neotypic doublet rate = 1.78%.
2022-03-09 00:02:20,093 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.59s.
2022-03-09 00:02:20,891 - pegasus.tools.preprocessing - INFO - After filtration, 20593/25653 genes are kept. Among 20593 genes, 16941 genes are robust.
2022-03-09 00:02:20,891 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.19s.
2022-03-09 00:02:20,960 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-03-09 00:02:20,973 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:21,005 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:21,006 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.04s.
2022-03-09 00:02:23,170 - pegasus.tools.doublet_detection - INFO - Sample MantonBM4_HiSeq_1: doublet threshold = 0.1453; total cells = 4172; neotypic doublet rate in simulation = 40.96%; neotypic doublet rate = 1.73%.
2022-03-09 00:02:23,604 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.60s.
2022-03-09 00:02:24,411 - pegasus.tools.preprocessing - INFO - After filtration, 20955/25653 genes are kept. Among 20955 genes, 17374 genes are robust.
2022-03-09 00:02:24,411 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.20s.
2022-03-09 00:02:24,490 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.08s.
2022-03-09 00:02:24,503 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:24,534 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:24,535 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.04s.
2022-03-09 00:02:26,448 - pegasus.tools.doublet_detection - INFO - Sample MantonBM5_HiSeq_1: doublet threshold = 0.1651; total cells = 4090; neotypic doublet rate in simulation = 33.70%; neotypic doublet rate = 1.59%.
2022-03-09 00:02:26,874 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.34s.
2022-03-09 00:02:27,708 - pegasus.tools.preprocessing - INFO - After filtration, 21035/25653 genes are kept. Among 21035 genes, 17344 genes are robust.
2022-03-09 00:02:27,709 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.22s.
2022-03-09 00:02:27,788 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.08s.
2022-03-09 00:02:27,802 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:27,836 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:27,837 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-03-09 00:02:30,166 - pegasus.tools.doublet_detection - INFO - Sample MantonBM6_HiSeq_1: doublet threshold = 0.1871; total cells = 4665; neotypic doublet rate in simulation = 41.07%; neotypic doublet rate = 1.59%.
2022-03-09 00:02:30,560 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.72s.
2022-03-09 00:02:31,331 - pegasus.tools.preprocessing - INFO - After filtration, 20449/25653 genes are kept. Among 20449 genes, 16754 genes are robust.
2022-03-09 00:02:31,332 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.22s.
2022-03-09 00:02:31,402 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-03-09 00:02:31,413 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:31,448 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:31,448 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-03-09 00:02:33,621 - pegasus.tools.doublet_detection - INFO - Sample MantonBM7_HiSeq_1: doublet threshold = 0.1420; total cells = 4452; neotypic doublet rate in simulation = 47.79%; neotypic doublet rate = 3.01%.
2022-03-09 00:02:34,022 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.57s.
2022-03-09 00:02:34,829 - pegasus.tools.preprocessing - INFO - After filtration, 20070/25653 genes are kept. Among 20070 genes, 16434 genes are robust.
2022-03-09 00:02:34,830 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.21s.
2022-03-09 00:02:34,902 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-03-09 00:02:34,914 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-03-09 00:02:34,947 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:34,948 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-03-09 00:02:37,237 - pegasus.tools.doublet_detection - INFO - Sample MantonBM8_HiSeq_1: doublet threshold = 0.1923; total cells = 4511; neotypic doublet rate in simulation = 34.21%; neotypic doublet rate = 1.15%.
2022-03-09 00:02:37,656 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.71s.
2022-03-09 00:02:38,252 - pegasus.tools.doublet_detection - INFO - Doublets are predicted!
2022-03-09 00:02:38,253 - pegasus.tools.doublet_detection - INFO - Function 'infer_doublets' finished in 28.30s.


pg.scatter(data,attrs=['anno','doublet_score'], basis='umap', wspace=1.2)


data.uns['pred_dbl_cluster']


pg.mark_doublets(data)
data.obs['demux_type'].value_counts()

singlet    34741
doublet      724
Name: demux_type, dtype: int64


pg.scatter(data, attrs = ['anno', 'demux_type'], legend_loc = ['on data', 'right margin'], 
           wspace = 0.1,alpha = [1.0, 0.8], palettes = 'demux_type:gainsboro,red')


pg.qc_metrics(data, select_singlets=True)
pg.filter_data(data)

2022-03-09 00:02:48,358 - pegasusio.qc_utils - INFO - After filtration, 34741 out of 35465 cell barcodes are kept in UnimodalData object GRCh38-rna.
2022-03-09 00:02:48,359 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.75s.


pg.highly_variable_features(data, batch='Channel')
pg.pca(data)
pca_key = pg.run_harmony(data)
pg.neighbors(data,rep=pca_key)
pg.louvain(data,rep=pca_key)
pg.umap(data,rep=pca_key)

2022-03-09 00:02:48,558 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.19s.
2022-03-09 00:02:48,593 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-09 00:02:48,594 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.23s.
2022-03-09 00:02:53,712 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 5.12s.
2022-03-09 00:02:54,115 - pegasus.tools.batch_correction - INFO - Start integration using Harmony.
	Initialization is completed.
	Completed 1 / 10 iteration(s).
	Completed 2 / 10 iteration(s).
	Completed 3 / 10 iteration(s).
	Completed 4 / 10 iteration(s).
	Completed 5 / 10 iteration(s).
	Completed 6 / 10 iteration(s).
Reach convergence after 6 iteration(s).
2022-03-09 00:03:11,262 - pegasus.tools.batch_correction - INFO - Function 'run_harmony' finished in 17.54s.
2022-03-09 00:03:15,674 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 4.41s.
2022-03-09 00:03:16,702 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 1.03s.
2022-03-09 00:03:17,786 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.07s.
2022-03-09 00:03:43,612 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get 16 clusters.
2022-03-09 00:03:43,644 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 26.93s.
2022-03-09 00:03:43,645 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2022-03-09 00:03:43,645 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
UMAP(min_dist=0.5, precomputed_knn=(array([[    0, 29734, 22400, ..., 20908, 32334, 19609],
       [    1, 13768, 10422, ..., 21043, 22195, 5157],
       [    2, 8957, 30941, ..., 33034, 34066, 33654],
       ...,
       [34738, 34456, 30776, ..., 19529, 33571, 16419],
       [34739, 16813, 5464, ..., 27302, 9559, 5461],
       [34740, 8469, 21696, ..., 31917, 15034, 4294]]), array([[0.       , 5.1935925, 5.269945 , ..., 5.562049 , 5.5635796,
        5.5645733],
       [0.       , 7.2327   , 7.6362915, ..., 8.319318 , 8.425452 ,
        8.513818 ],
       [0.       , 4.881791 , 4.939791 , ..., 5.1259136, 5.1602726,
        5.183066 ],
       ...,
       [0.       , 6.4967422, 7.016324 , ..., 7.6317034, 7.672968 ,
        7.677658 ],
       [0.       , 7.9843946, 8.121859 , ..., 9.06644  , 9.13221  ,
        9.18505  ],
       [0.       , 5.8649535, 6.6823955, ..., 7.797544 , 7.8195515,
        7.9567733]], dtype=float32), <pegasus.tools.visualization.DummyNNDescent object at 0x7fad4c3018d0>), random_state=0, verbose=True)
Wed Mar  9 00:03:43 2022 Construct fuzzy simplicial set
Wed Mar  9 00:03:45 2022 Construct embedding

Wed Mar  9 00:04:07 2022 Finished embedding
2022-03-09 00:04:07,911 - pegasus.tools.visualization - INFO - Function 'umap' finished in 24.27s.


pg.de_analysis(data, cluster='louvain_labels')
celltype_dict = pg.infer_cell_types(data, markers = 'human_immune',de_test='mwu',output_file='BM_celltype_re_dict.txt')
cluster_names = pg.infer_cluster_names(celltype_dict)
pg.annotate(data, name='anno', based_on='louvain_labels', anno_dict=cluster_names)

2022-03-09 00:04:08,788 - pegasus.tools.diff_expr - INFO - CSR matrix is converted to CSC matrix. Time spent = 0.8645s.
2022-03-09 00:04:26,835 - pegasus.tools.diff_expr - INFO - MWU test and AUROC calculation are finished. Time spent = 18.0463s.
2022-03-09 00:04:26,942 - pegasus.tools.diff_expr - INFO - Sufficient statistics are collected. Time spent = 0.1065s.
2022-03-09 00:04:27,032 - pegasus.tools.diff_expr - INFO - Differential expression analysis is finished.
2022-03-09 00:04:27,033 - pegasus.tools.diff_expr - INFO - Function 'de_analysis' finished in 19.11s.


pg.scatter(data,attrs='anno',legend_loc='on data',basis='umap')

	cluster	percentage	pval	qval
0	CD14+ Monocyte	4.462044	0.000003	0.000018
1	Natural killer cell	3.307692	0.000013	0.000036
2	B cell	3.277932	0.000004	0.000018

Doublet Detection Tutorial¶

Dataset¶

Sections¶

Detect Doublets¶

Remove Doublets and Recluster¶