Author: Bo Li, Yiming Yang
Date: 2022-04-10
Notebook Source: pseudobulk.ipynb
import pegasus as pg
import pandas as pd
In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, create pseudobulk matrix regarding donors, and perform pseudobulk analysis on the data.
The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).
Now load the count matrix:
data = pg.read_input('MantonBM_nonmix_subset.zarr.zip')
2022-04-10 21:52:30,898 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded. 2022-04-10 21:52:30,899 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.38s.
Pegasus provides pseudobulk
function to generate pseudobulk matrix. The code below generate a pseudobulk count matrix regarding donors (Channel
), and transfer the gender attribute to the resulting pseudobulks:
pseudo = pg.pseudobulk(data, 'Channel', 'gender')
pseudo
2022-04-10 21:52:31,157 - pegasus.tools.pseudobulk - INFO - Function 'pseudobulk' finished in 0.25s.
UnimodalData object with n_obs x n_vars = 8 x 36601 Genome: Channel; Modality: pseudobulk It contains 1 matrix: 'X' It currently binds to matrix 'X' as X obs: 'gender' var: 'featureid' uns: 'genome', 'modality'
For details on pseudobulk
function, please see its documentation.
Pegasus has deseq2
function to perform DE analysis on Pseudobulk data, which is a Python wrapper of DESeq2 package in R (You need to first install the original R package). The code below analyzes based on a regression model considering the gender attribute, and estimates the contrast between female
and male
:
pg.deseq2(pseudo, '~gender', ('gender', 'female', 'male'))
R[write to console]: estimating size factors R[write to console]: estimating dispersions R[write to console]: gene-wise dispersion estimates R[write to console]: mean-dispersion relationship R[write to console]: final dispersion estimates R[write to console]: fitting model and testing
2022-04-10 21:52:47,487 - pegasus.tools.pseudobulk - INFO - Function 'deseq2' finished in 16.32s.
The DE result is a Pandas DataFrame object stored in pseudo.varm["deseq2"]
, which could be viewed as the following (ranked by log2 fold-change):
pd.DataFrame(pseudo.varm["deseq2"], index=pseudo.var_names)
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
---|---|---|---|---|---|---|
featurekey | ||||||
MIR1302-2HG | 0.113807 | 0.696538 | 3.533812 | 0.197107 | 0.843744 | 1.00000 |
FAM138A | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
OR4F5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
AL627309.1 | 3.131268 | 0.239777 | 0.913821 | 0.262389 | 0.793022 | 0.99985 |
AL627309.3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
... | ... | ... | ... | ... | ... | ... |
AC141272.1 | 0.113807 | 0.696538 | 3.533812 | 0.197107 | 0.843744 | 1.00000 |
AC023491.2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
AC007325.1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
AC007325.4 | 68.286678 | -0.911313 | 0.724284 | -1.258226 | 0.208310 | 0.99985 |
AC007325.2 | 0.121696 | -0.746135 | 3.533812 | -0.211142 | 0.832777 | 1.00000 |
36601 rows × 6 columns
Notice that the fold-change calculated has female
be the numerator, while male
be the denominator.
For details on deseq2
function, please see its documentation.
markers = pg.pseudo.markers(pseudo)
print(markers)
{'up': baseMean log2FoldChange lfcSE stat pvalue \ featurekey XIST 1910.572444 13.615371 1.043313 13.050131 6.343849e-39 PPP1R2C 14.765863 7.317893 1.165723 6.277559 3.439292e-10 LINC00221 9.121555 5.188543 1.230549 4.216444 2.481844e-05 TSIX 13.654597 3.889226 0.725102 5.363692 8.153793e-08 KANTR 88.919116 1.248766 0.289454 4.314216 1.601703e-05 BX890604.2 297.379563 1.233275 0.260909 4.726846 2.280338e-06 JPX 1129.585486 0.888090 0.198601 4.471729 7.758989e-06 padj featurekey XIST 5.397347e-35 PPP1R2C 3.657687e-07 LINC00221 1.759627e-02 TSIX 7.708052e-05 KANTR 1.184982e-02 BX890604.2 1.940112e-03 JPX 6.001225e-03 , 'down': baseMean log2FoldChange lfcSE stat pvalue \ featurekey RPS4Y1 6547.060706 -16.138498 1.050588 -15.361401 2.970823e-53 EIF1AY 1316.973725 -13.825299 1.087960 -12.707548 5.369171e-37 DDX3Y 451.442937 -12.280059 1.083214 -11.336691 8.634834e-30 UTY 195.114474 -11.068404 1.074641 -10.299634 7.073023e-25 PRKY 152.182688 -10.709941 1.084543 -9.875073 5.339433e-23 KDM5D 132.117709 -10.504537 1.069489 -9.822019 9.051192e-23 USP9Y 208.307283 -10.440421 1.086045 -9.613253 7.029269e-22 ZFY 87.153393 -9.905218 1.071056 -9.248081 2.285583e-20 AC244213.1 67.131335 -9.527208 1.097804 -8.678421 4.013006e-18 TTTY14 61.185802 -9.402545 1.084608 -8.669075 4.356434e-18 TMSB4Y 39.206868 -8.755393 1.075498 -8.140781 3.927369e-16 LINC00278 27.826253 -8.258319 1.099821 -7.508787 5.967764e-14 AC006157.1 19.899699 -7.768185 1.090473 -7.123686 1.050784e-12 TTTY10 12.079865 -6.316182 1.116495 -5.657152 1.539057e-08 AC010889.1 5.644311 -5.964417 1.227109 -4.860542 1.170648e-06 SCGB3A1 110.430089 -3.424882 0.544527 -6.289648 3.181862e-10 TPSB2 63.396259 -2.604417 0.560631 -4.645512 3.392345e-06 padj featurekey RPS4Y1 5.055152e-49 EIF1AY 3.045394e-33 DDX3Y 3.673258e-26 UTY 2.407091e-21 PRKY 1.514263e-19 KDM5D 2.200215e-19 USP9Y 1.495125e-18 ZFY 4.321275e-17 AC244213.1 6.739007e-15 TTTY14 6.739007e-15 TMSB4Y 5.569009e-13 LINC00278 7.811343e-11 AC006157.1 1.277153e-09 TTTY10 1.540506e-05 AC010889.1 1.048408e-03 SCGB3A1 3.609505e-07 TPSB2 2.748769e-03 }
pg.pseudo.write_results_to_excel(markers, 'test.de.xlsx')
2022-04-10 21:52:47,568 - pegasus.pseudo.convenient - INFO - Excel spreadsheet is written. 2022-04-10 21:52:47,569 - pegasus.pseudo.convenient - INFO - Function 'write_results_to_excel' finished in 0.04s.
pg.pseudo.volcano(pseudo)
pg.fgsea(pseudo, 'log2FoldChange', 'canonical_pathways', 'deseq2', fgsea_key = 'fgsea_deseq2')
2022-04-10 21:52:49,623 - pegasus.tools.utils - INFO - Loaded signatures from GMT file /home/yiming/GitHub/pegasus/pegasus/data_files/c2.cp.v7.5.1.symbols.gmt. 2022-04-10 21:52:53,051 - pegasus.tools.fgsea - INFO - Function 'fgsea' finished in 3.63s.
The code above runs GSEA analysis based on log2 fold-change values, and use one of the preset gene sets that Pegasus provides:
canonical_pathways
: The MsigDB C2/CP gene set.hallmark
: The MsigDB H gene set.
Notice that this function is applicable to DE results from both single-cell and pseudobulk data.For details on fgsea
function, please see its documentation.
Finally, generate the GSEA histograms by plot_gsea
function. In general, there will be 2 panels:
pg.plot_gsea(pseudo, 'fgsea_deseq2')
As shown in the plot above, for this pseudobulk data, there is no up-regulated (i.e. female significant) pathways, while some down-regulated (i.e. male significant) pathways are observed.