Analyze RNA-seq dataset by trans-PCO pipeline

Goal

To search for trans-eQTLs associated with gene co-expression modules in a real RNA-seq dataset (Battle et al. 2014) using trans-PCO pipeline.

Main steps

The workflow is written in snakemake (see bottom code). There are a few main steps, including,

Regress covariates out of gene expression profiles.
Cluster genes into gene modules with highly co-expressed genes.
Calculate z-scores of SNPs across the genome for all genes by tensorQTL.
Calculate p-values of each pair of (a SNP, a gene module) across all SNPs and gene modules using PCO test.
Permute sample labels to obtain the empirical null p-value distribution, in order to control FDR (<10%) among p-values calculated in last step.
Identify significant (trans-eQTLs, gene module) pairs.

Snakemake workflow

Steps 1-2,

configfile: "config.yaml"

rule all:
  input: 'result/'+config['file_coexp_module']


rule covariates:
  input:
    file_gene_annotation=config['dir_mappability']+config['file_gene_annotation'],
    file_mappability=config['dir_mappability']+config['file_mappability'],
    file_cross_mappability=config['dir_mappability']+config['file_cross_mappability'],
    file_covariates=config['dir_expression']+config['file_covariates'],
    file_ex=config['dir_expression']+config['file_ex']
  output:
    file_gene_meta='result/'+config['file_gene_meta'],
    file_ex_var_regressed='result/'+config['file_ex_var_regressed']
  script:
    'script/'+config['script_covariates']


rule coexp_module:
  input:
    file_ex_var_regressed='result/'+config['file_ex_var_regressed']
  output:
    file_coexp_module='result/'+config['file_coexp_module'],
    file_Nmodule='result/Nmodule.txt'
  params: minModuleSize=config['minModuleSize']
  script:
    'script/'+config['script_coexp_module']

Steps 3-6,

configfile: "config.yaml"

path='result/Nmodule.txt'
with open(path) as f:
    lines = [x.rstrip() for x in f]
Nmodule=int(lines[0])

MODULE=list(range(1, Nmodule+1))
CHRS=list(range(1, config['Nchr']+1))
PERM=list(range(1, config['Nperm']+1))
fdr_thre_chr_module=config['fdr_level']

rule all:
  input:
    #expand('p/p.module{module}.chr{chr}.rds', module=MODULE, chr=CHRS)
    'postanalysis/indep.signals.chr.module.perm'+str(config['Nperm'])+'.txt'

rule prep_bed:
  input:
    file_covariates=config['dir_expression']+config['file_covariates'],
    file_gene_meta='result/'+config['file_gene_meta'],
    file_coexp_module='result/'+config['file_coexp_module'],
    file_ex=config['dir_expression']+config['file_ex']
  output:
    file_expression=expand('result/expression.module{module}.bed.gz', module=MODULE)
  params:
    data_type='obs'
  script:
    'script/'+config['script_prep_bed']

rule z:
  input:
    expression_bed='result/expression.module{module}.bed.gz',
    file_covariates=config['dir_expression']+config['file_covariates']
  output:
    file_z='z/z.module{module}.chr{chr}.txt.gz'
  params:
    plink_prefix_path=config['dir_geno']+config['geno_prefix']+'{chr}'+config['geno_suffix'],
    prefix='module{module}.chr{chr}',
    dir_script='script/'
  shell: 'bash '+'script/'+config['script_z']+' {params.plink_prefix_path} {input.expression_bed} {input.file_covariates} {params.prefix} {params.dir_script} {output.file_z}'

rule p:
  input:
    file_ex_var_regressed='result/'+config['file_ex_var_regressed'],
    file_gene_meta='result/'+config['file_gene_meta'],
    file_coexp_module='result/'+config['file_coexp_module'],
    file_z='z/z.module{module}.chr{chr}.txt.gz'
  output:
    file_p='p/p.module{module}.chr{chr}.rds'
  params:
    dir_script='script/', chr='{chr}', module='{module}'
  script:
    'script/'+config['script_p']

rule prep_bed_null:
  input:
    file_covariates=config['dir_expression']+config['file_covariates'],
    file_gene_meta='result/'+config['file_gene_meta'],
    file_coexp_module='result/'+config['file_coexp_module'],
    file_ex=config['dir_expression']+config['file_ex']
  output:
    file_expression=temp(expand('result/expression.null.module{module}.perm{perm}.bed.gz', module=MODULE, allow_missing=True)),
    file_covariates_null=temp('result/covariates.null.perm{perm}.txt')
  params:
    data_type='null'
  script:
    'script/'+config['script_prep_bed']


rule z_null:
  input:
    expression_bed='result/expression.null.module{module}.perm{perm}.bed.gz',
    file_covariates='result/covariates.null.perm{perm}.txt'
  output:
    file_z=temp('z/z.null.module{module}.chr{chr}.perm{perm}.txt.gz')
  params:
    plink_prefix_path=config['dir_geno']+config['geno_prefix']+'{chr}'+config['geno_suffix'],
    prefix='module{module}.chr{chr}.perm{perm}.null',
    dir_script='script/'
  shell: 'bash script/'+config['script_z']+' {params.plink_prefix_path} {input.expression_bed} {input.file_covariates} {params.prefix} {params.dir_script} {output.file_z}'


rule p_null:
  input:
    file_ex_var_regressed='result/'+config['file_ex_var_regressed'],
    file_gene_meta='result/'+config['file_gene_meta'],
    file_coexp_module='result/'+config['file_coexp_module'],
    file_z='z/z.null.module{module}.chr{chr}.perm{perm}.txt.gz'
  output:
    file_p='p/p.null.module{module}.chr{chr}.perm{perm}.rds'
  params:
    dir_script='script/', chr='{chr}', module='{module}'
  script:
    'script/'+config['script_p']


rule FDR_chr_module:
  input:
    file_p=expand('p/p.module{module}.chr{chr}.rds',chr=CHRS,module=MODULE),
    file_p_null=expand('p/p.null.module{module}.chr{chr}.perm{perm}.rds',chr=CHRS,module=MODULE, allow_missing=True)
  output:
    file_q='FDR/q.chr.module.perm{perm}.rds'
  script:
    'script/'+config['script_q']

rule average_perm_chr_module:
  input:
    file_q=expand('FDR/q.chr.module.perm{perm}.rds', perm=PERM)
  output:
    file_signals='FDR/signals.chr.module.perm'+str(config['Nperm'])+'.txt'
  params:
    fdr_thre=fdr_thre_chr_module
  script:
    'script/'+config['script_average_perm']


rule post_chr_module:
  input: sig='FDR/signals.chr.module.perm'+str(config['Nperm'])+'.txt'
  output: sig_uniq='postanalysis/LD.prun.in.chr.module.perm'+str(config['Nperm'])+'.txt', sig_indp='postanalysis/indep.signals.chr.module.perm'+str(config['Nperm'])+'.txt'
  params: dir_geno=config['dir_geno'], geno_prefix=config['geno_prefix'], geno_suffix=config['geno_suffix']
  shell: 'bash script/'+config['script_post']+' {input.sig} {output.sig_uniq} {output.sig_indp} {params.dir_geno} {params.geno_prefix} {params.geno_suffix}'

Session info

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /scratch/midway2/liliw1/conda_env/rstudio-server/lib/libopenblasp-r0.3.18.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
## [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.7    ggplot2_3.3.6  
## [9] tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1 xfun_0.29        bslib_0.3.1      haven_2.4.3     
##  [5] colorspace_2.0-3 vctrs_0.4.1      generics_0.1.2   htmltools_0.5.2 
##  [9] yaml_2.2.2       utf8_1.2.2       rlang_1.0.3      jquerylib_0.1.4 
## [13] pillar_1.7.0     withr_2.5.0      glue_1.6.2       DBI_1.1.3       
## [17] dbplyr_2.1.1     modelr_0.1.8     readxl_1.3.1     lifecycle_1.0.1 
## [21] cellranger_1.1.0 munsell_0.5.0    gtable_0.3.0     rvest_1.0.2     
## [25] evaluate_0.14    knitr_1.37       tzdb_0.2.0       fastmap_1.1.0   
## [29] fansi_1.0.3      broom_0.7.12     Rcpp_1.0.8.3     backports_1.4.1 
## [33] scales_1.2.0     jsonlite_1.7.3   fs_1.5.2         hms_1.1.1       
## [37] digest_0.6.29    stringi_1.7.6    grid_4.1.2       cli_3.3.0       
## [41] tools_4.1.2      magrittr_2.0.3   sass_0.4.0       crayon_1.5.1    
## [45] pkgconfig_2.0.3  ellipsis_0.3.2   xml2_1.3.3       reprex_2.0.1    
## [49] lubridate_1.8.0  assertthat_0.2.1 rmarkdown_2.11   httr_1.4.2      
## [53] rstudioapi_0.13  R6_2.5.1         compiler_4.1.2

References

Battle, Alexis, Sara Mostafavi, Xiaowei Zhu, James B. Potash, Myrna M. Weissman, Courtney McCormick, Christian D. Haudenschild, et al. 2014. “Characterizing the Genetic Basis of Transcriptome Diversity Through RNA-Sequencing of 922 Individuals.” Genome Research 24 (1): 14–24. https://doi.org/10.1101/gr.155192.113.