Pegasus Pseudobulk Analysis¶

Author: Bo Li, Yiming Yang
Date: 2022-04-10
Notebook Source: pseudobulk.ipynb

In [1]:
import pegasus as pg
import pandas as pd

Dataset¶

In this tutorial, we'll use a gene-count matrix dataset on human bone marrow from 8 donors, create pseudobulk matrix regarding donors, and perform pseudobulk analysis on the data.

The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip).

Now load the count matrix:

In [2]:
data = pg.read_input('MantonBM_nonmix_subset.zarr.zip')
2022-04-10 21:52:30,898 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded.
2022-04-10 21:52:30,899 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.38s.

Generate pseudobulk matrix¶

Pegasus provides pseudobulk function to generate pseudobulk matrix. The code below generate a pseudobulk count matrix regarding donors (Channel), and transfer the gender attribute to the resulting pseudobulks:

In [3]:
pseudo = pg.pseudobulk(data, 'Channel', 'gender')
pseudo
2022-04-10 21:52:31,157 - pegasus.tools.pseudobulk - INFO - Function 'pseudobulk' finished in 0.25s.
Out[3]:
UnimodalData object with n_obs x n_vars = 8 x 36601
    Genome: Channel; Modality: pseudobulk
    It contains 1 matrix: 'X'
    It currently binds to matrix 'X' as X

    obs: 'gender'
    var: 'featureid'
    uns: 'genome', 'modality'

For details on pseudobulk function, please see its documentation.

Differential Expression (DE) Analysis on Pseudobulk Matrix¶

Pegasus has deseq2 function to perform DE analysis on Pseudobulk data, which is a Python wrapper of DESeq2 package in R (You need to first install the original R package). The code below analyzes based on a regression model considering the gender attribute, and estimates the contrast between female and male:

In [4]:
pg.deseq2(pseudo, '~gender', ('gender', 'female', 'male'))
R[write to console]: estimating size factors

R[write to console]: estimating dispersions

R[write to console]: gene-wise dispersion estimates

R[write to console]: mean-dispersion relationship

R[write to console]: final dispersion estimates

R[write to console]: fitting model and testing

2022-04-10 21:52:47,487 - pegasus.tools.pseudobulk - INFO - Function 'deseq2' finished in 16.32s.

The DE result is a Pandas DataFrame object stored in pseudo.varm["deseq2"], which could be viewed as the following (ranked by log2 fold-change):

In [5]:
pd.DataFrame(pseudo.varm["deseq2"], index=pseudo.var_names)
Out[5]:
baseMean log2FoldChange lfcSE stat pvalue padj
featurekey
MIR1302-2HG 0.113807 0.696538 3.533812 0.197107 0.843744 1.00000
FAM138A 0.000000 0.000000 0.000000 0.000000 1.000000 1.00000
OR4F5 0.000000 0.000000 0.000000 0.000000 1.000000 1.00000
AL627309.1 3.131268 0.239777 0.913821 0.262389 0.793022 0.99985
AL627309.3 0.000000 0.000000 0.000000 0.000000 1.000000 1.00000
... ... ... ... ... ... ...
AC141272.1 0.113807 0.696538 3.533812 0.197107 0.843744 1.00000
AC023491.2 0.000000 0.000000 0.000000 0.000000 1.000000 1.00000
AC007325.1 0.000000 0.000000 0.000000 0.000000 1.000000 1.00000
AC007325.4 68.286678 -0.911313 0.724284 -1.258226 0.208310 0.99985
AC007325.2 0.121696 -0.746135 3.533812 -0.211142 0.832777 1.00000

36601 rows × 6 columns

Notice that the fold-change calculated has female be the numerator, while male be the denominator.

For details on deseq2 function, please see its documentation.

Get significant DE genes in human-readable format¶

In [6]:
markers = pg.pseudo.markers(pseudo)
print(markers)
{'up':                baseMean  log2FoldChange     lfcSE       stat        pvalue  \
featurekey                                                                   
XIST        1910.572444       13.615371  1.043313  13.050131  6.343849e-39   
PPP1R2C       14.765863        7.317893  1.165723   6.277559  3.439292e-10   
LINC00221      9.121555        5.188543  1.230549   4.216444  2.481844e-05   
TSIX          13.654597        3.889226  0.725102   5.363692  8.153793e-08   
KANTR         88.919116        1.248766  0.289454   4.314216  1.601703e-05   
BX890604.2   297.379563        1.233275  0.260909   4.726846  2.280338e-06   
JPX         1129.585486        0.888090  0.198601   4.471729  7.758989e-06   

                    padj  
featurekey                
XIST        5.397347e-35  
PPP1R2C     3.657687e-07  
LINC00221   1.759627e-02  
TSIX        7.708052e-05  
KANTR       1.184982e-02  
BX890604.2  1.940112e-03  
JPX         6.001225e-03  , 'down':                baseMean  log2FoldChange     lfcSE       stat        pvalue  \
featurekey                                                                   
RPS4Y1      6547.060706      -16.138498  1.050588 -15.361401  2.970823e-53   
EIF1AY      1316.973725      -13.825299  1.087960 -12.707548  5.369171e-37   
DDX3Y        451.442937      -12.280059  1.083214 -11.336691  8.634834e-30   
UTY          195.114474      -11.068404  1.074641 -10.299634  7.073023e-25   
PRKY         152.182688      -10.709941  1.084543  -9.875073  5.339433e-23   
KDM5D        132.117709      -10.504537  1.069489  -9.822019  9.051192e-23   
USP9Y        208.307283      -10.440421  1.086045  -9.613253  7.029269e-22   
ZFY           87.153393       -9.905218  1.071056  -9.248081  2.285583e-20   
AC244213.1    67.131335       -9.527208  1.097804  -8.678421  4.013006e-18   
TTTY14        61.185802       -9.402545  1.084608  -8.669075  4.356434e-18   
TMSB4Y        39.206868       -8.755393  1.075498  -8.140781  3.927369e-16   
LINC00278     27.826253       -8.258319  1.099821  -7.508787  5.967764e-14   
AC006157.1    19.899699       -7.768185  1.090473  -7.123686  1.050784e-12   
TTTY10        12.079865       -6.316182  1.116495  -5.657152  1.539057e-08   
AC010889.1     5.644311       -5.964417  1.227109  -4.860542  1.170648e-06   
SCGB3A1      110.430089       -3.424882  0.544527  -6.289648  3.181862e-10   
TPSB2         63.396259       -2.604417  0.560631  -4.645512  3.392345e-06   

                    padj  
featurekey                
RPS4Y1      5.055152e-49  
EIF1AY      3.045394e-33  
DDX3Y       3.673258e-26  
UTY         2.407091e-21  
PRKY        1.514263e-19  
KDM5D       2.200215e-19  
USP9Y       1.495125e-18  
ZFY         4.321275e-17  
AC244213.1  6.739007e-15  
TTTY14      6.739007e-15  
TMSB4Y      5.569009e-13  
LINC00278   7.811343e-11  
AC006157.1  1.277153e-09  
TTTY10      1.540506e-05  
AC010889.1  1.048408e-03  
SCGB3A1     3.609505e-07  
TPSB2       2.748769e-03  }

Write DE results to spreadsheet¶

In [7]:
pg.pseudo.write_results_to_excel(markers, 'test.de.xlsx')
2022-04-10 21:52:47,568 - pegasus.pseudo.convenient - INFO - Excel spreadsheet is written.
2022-04-10 21:52:47,569 - pegasus.pseudo.convenient - INFO - Function 'write_results_to_excel' finished in 0.04s.

Generate volcano plot¶

In [8]:
pg.pseudo.volcano(pseudo)

Gene Set Enrichment Analysis (GSEA)¶

Pegasus has fgsea function to perform GSEA analysis, which is a Python wrapper of fgsea package in R (You need to first install the original R package):

In [9]:
pg.fgsea(pseudo, 'log2FoldChange', 'canonical_pathways', 'deseq2', fgsea_key = 'fgsea_deseq2')
2022-04-10 21:52:49,623 - pegasus.tools.utils - INFO - Loaded signatures from GMT file /home/yiming/GitHub/pegasus/pegasus/data_files/c2.cp.v7.5.1.symbols.gmt.
2022-04-10 21:52:53,051 - pegasus.tools.fgsea - INFO - Function 'fgsea' finished in 3.63s.

The code above runs GSEA analysis based on log2 fold-change values, and use one of the preset gene sets that Pegasus provides:

  • canonical_pathways: The MsigDB C2/CP gene set.
  • hallmark: The MsigDB H gene set. Notice that this function is applicable to DE results from both single-cell and pseudobulk data.

For details on fgsea function, please see its documentation.

Generate GSEA plots¶

Finally, generate the GSEA histograms by plot_gsea function. In general, there will be 2 panels:

  • Top panel shows the up-regulated pathways in red color.
  • Bottom panel shows the down-regulated pathways in green color.
In [10]:
pg.plot_gsea(pseudo, 'fgsea_deseq2')

As shown in the plot above, for this pseudobulk data, there is no up-regulated (i.e. female significant) pathways, while some down-regulated (i.e. male significant) pathways are observed.