Author: Bo Li, Yiming Yang
Date: 2022-04-10
Notebook Source: pseudobulk.ipynb
import pegasus as pg
import pandas as pd
In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, create pseudobulk matrix regarding donors, and perform pseudobulk analysis on the data.
The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).
Now load the count matrix:
data = pg.read_input('MantonBM_nonmix_subset.zarr.zip')
2022-04-10 21:52:30,898 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded. 2022-04-10 21:52:30,899 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.38s.
Pegasus provides pseudobulk function to generate pseudobulk matrix. The code below generate a pseudobulk count matrix regarding donors (Channel), and transfer the gender attribute to the resulting pseudobulks:
pseudo = pg.pseudobulk(data, 'Channel', 'gender')
pseudo
2022-04-10 21:52:31,157 - pegasus.tools.pseudobulk - INFO - Function 'pseudobulk' finished in 0.25s.
UnimodalData object with n_obs x n_vars = 8 x 36601
Genome: Channel; Modality: pseudobulk
It contains 1 matrix: 'X'
It currently binds to matrix 'X' as X
obs: 'gender'
var: 'featureid'
uns: 'genome', 'modality'
For details on pseudobulk function, please see its documentation.
Pegasus has deseq2 function to perform DE analysis on Pseudobulk data, which is a Python wrapper of DESeq2 package in R (You need to first install the original R package). The code below analyzes based on a regression model considering the gender attribute, and estimates the contrast between female and male:
pg.deseq2(pseudo, '~gender', ('gender', 'female', 'male'))
R[write to console]: estimating size factors R[write to console]: estimating dispersions R[write to console]: gene-wise dispersion estimates R[write to console]: mean-dispersion relationship R[write to console]: final dispersion estimates R[write to console]: fitting model and testing
2022-04-10 21:52:47,487 - pegasus.tools.pseudobulk - INFO - Function 'deseq2' finished in 16.32s.
The DE result is a Pandas DataFrame object stored in pseudo.varm["deseq2"], which could be viewed as the following (ranked by log2 fold-change):
pd.DataFrame(pseudo.varm["deseq2"], index=pseudo.var_names)
| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
|---|---|---|---|---|---|---|
| featurekey | ||||||
| MIR1302-2HG | 0.113807 | 0.696538 | 3.533812 | 0.197107 | 0.843744 | 1.00000 |
| FAM138A | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
| OR4F5 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
| AL627309.1 | 3.131268 | 0.239777 | 0.913821 | 0.262389 | 0.793022 | 0.99985 |
| AL627309.3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
| ... | ... | ... | ... | ... | ... | ... |
| AC141272.1 | 0.113807 | 0.696538 | 3.533812 | 0.197107 | 0.843744 | 1.00000 |
| AC023491.2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
| AC007325.1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
| AC007325.4 | 68.286678 | -0.911313 | 0.724284 | -1.258226 | 0.208310 | 0.99985 |
| AC007325.2 | 0.121696 | -0.746135 | 3.533812 | -0.211142 | 0.832777 | 1.00000 |
36601 rows × 6 columns
Notice that the fold-change calculated has female be the numerator, while male be the denominator.
For details on deseq2 function, please see its documentation.
markers = pg.pseudo.markers(pseudo)
print(markers)
{'up': baseMean log2FoldChange lfcSE stat pvalue \
featurekey
XIST 1910.572444 13.615371 1.043313 13.050131 6.343849e-39
PPP1R2C 14.765863 7.317893 1.165723 6.277559 3.439292e-10
LINC00221 9.121555 5.188543 1.230549 4.216444 2.481844e-05
TSIX 13.654597 3.889226 0.725102 5.363692 8.153793e-08
KANTR 88.919116 1.248766 0.289454 4.314216 1.601703e-05
BX890604.2 297.379563 1.233275 0.260909 4.726846 2.280338e-06
JPX 1129.585486 0.888090 0.198601 4.471729 7.758989e-06
padj
featurekey
XIST 5.397347e-35
PPP1R2C 3.657687e-07
LINC00221 1.759627e-02
TSIX 7.708052e-05
KANTR 1.184982e-02
BX890604.2 1.940112e-03
JPX 6.001225e-03 , 'down': baseMean log2FoldChange lfcSE stat pvalue \
featurekey
RPS4Y1 6547.060706 -16.138498 1.050588 -15.361401 2.970823e-53
EIF1AY 1316.973725 -13.825299 1.087960 -12.707548 5.369171e-37
DDX3Y 451.442937 -12.280059 1.083214 -11.336691 8.634834e-30
UTY 195.114474 -11.068404 1.074641 -10.299634 7.073023e-25
PRKY 152.182688 -10.709941 1.084543 -9.875073 5.339433e-23
KDM5D 132.117709 -10.504537 1.069489 -9.822019 9.051192e-23
USP9Y 208.307283 -10.440421 1.086045 -9.613253 7.029269e-22
ZFY 87.153393 -9.905218 1.071056 -9.248081 2.285583e-20
AC244213.1 67.131335 -9.527208 1.097804 -8.678421 4.013006e-18
TTTY14 61.185802 -9.402545 1.084608 -8.669075 4.356434e-18
TMSB4Y 39.206868 -8.755393 1.075498 -8.140781 3.927369e-16
LINC00278 27.826253 -8.258319 1.099821 -7.508787 5.967764e-14
AC006157.1 19.899699 -7.768185 1.090473 -7.123686 1.050784e-12
TTTY10 12.079865 -6.316182 1.116495 -5.657152 1.539057e-08
AC010889.1 5.644311 -5.964417 1.227109 -4.860542 1.170648e-06
SCGB3A1 110.430089 -3.424882 0.544527 -6.289648 3.181862e-10
TPSB2 63.396259 -2.604417 0.560631 -4.645512 3.392345e-06
padj
featurekey
RPS4Y1 5.055152e-49
EIF1AY 3.045394e-33
DDX3Y 3.673258e-26
UTY 2.407091e-21
PRKY 1.514263e-19
KDM5D 2.200215e-19
USP9Y 1.495125e-18
ZFY 4.321275e-17
AC244213.1 6.739007e-15
TTTY14 6.739007e-15
TMSB4Y 5.569009e-13
LINC00278 7.811343e-11
AC006157.1 1.277153e-09
TTTY10 1.540506e-05
AC010889.1 1.048408e-03
SCGB3A1 3.609505e-07
TPSB2 2.748769e-03 }
pg.pseudo.write_results_to_excel(markers, 'test.de.xlsx')
2022-04-10 21:52:47,568 - pegasus.pseudo.convenient - INFO - Excel spreadsheet is written. 2022-04-10 21:52:47,569 - pegasus.pseudo.convenient - INFO - Function 'write_results_to_excel' finished in 0.04s.
pg.pseudo.volcano(pseudo)
pg.fgsea(pseudo, 'log2FoldChange', 'canonical_pathways', 'deseq2', fgsea_key = 'fgsea_deseq2')
2022-04-10 21:52:49,623 - pegasus.tools.utils - INFO - Loaded signatures from GMT file /home/yiming/GitHub/pegasus/pegasus/data_files/c2.cp.v7.5.1.symbols.gmt. 2022-04-10 21:52:53,051 - pegasus.tools.fgsea - INFO - Function 'fgsea' finished in 3.63s.
The code above runs GSEA analysis based on log2 fold-change values, and use one of the preset gene sets that Pegasus provides:
canonical_pathways: The MsigDB C2/CP gene set.hallmark: The MsigDB H gene set.
Notice that this function is applicable to DE results from both single-cell and pseudobulk data.For details on fgsea function, please see its documentation.
Finally, generate the GSEA histograms by plot_gsea function. In general, there will be 2 panels:
pg.plot_gsea(pseudo, 'fgsea_deseq2')
As shown in the plot above, for this pseudobulk data, there is no up-regulated (i.e. female significant) pathways, while some down-regulated (i.e. male significant) pathways are observed.