Goal
To search for trans-eQTLs associated with gene co-expression modules in a real RNA-seq dataset (Battle et al. 2014) using trans-PCO pipeline.
Main steps
The workflow is written in snakemake (see bottom code). There are a few main steps, including,
Regress covariates out of gene expression profiles.
Cluster genes into gene modules with highly co-expressed genes.
Calculate z-scores of SNPs across the genome for all genes by tensorQTL.
Calculate p-values of each pair of (a SNP, a gene module) across all SNPs and gene modules using PCO test.
Permute sample labels to obtain the empirical null p-value distribution, in order to control FDR (<10%) among p-values calculated in last step.
Identify significant (trans-eQTLs, gene module) pairs.
Snakemake workflow
configfile: "config.yaml"
rule all:
input: 'result/'+config['file_coexp_module']
rule covariates:
input:
file_gene_annotation=config['dir_mappability']+config['file_gene_annotation'],
file_mappability=config['dir_mappability']+config['file_mappability'],
file_cross_mappability=config['dir_mappability']+config['file_cross_mappability'],
file_covariates=config['dir_expression']+config['file_covariates'],
file_ex=config['dir_expression']+config['file_ex']
output:
file_gene_meta='result/'+config['file_gene_meta'],
file_ex_var_regressed='result/'+config['file_ex_var_regressed']
script:
'script/'+config['script_covariates']
rule coexp_module:
input:
file_ex_var_regressed='result/'+config['file_ex_var_regressed']
output:
file_coexp_module='result/'+config['file_coexp_module'],
file_Nmodule='result/Nmodule.txt'
params: minModuleSize=config['minModuleSize']
script:
'script/'+config['script_coexp_module']
configfile: "config.yaml"
path='result/Nmodule.txt'
with open(path) as f:
lines = [x.rstrip() for x in f]
Nmodule=int(lines[0])
MODULE=list(range(1, Nmodule+1))
CHRS=list(range(1, config['Nchr']+1))
PERM=list(range(1, config['Nperm']+1))
fdr_thre_chr_module=config['fdr_level']
rule all:
input:
#expand('p/p.module{module}.chr{chr}.rds', module=MODULE, chr=CHRS)
'postanalysis/indep.signals.chr.module.perm'+str(config['Nperm'])+'.txt'
rule prep_bed:
input:
file_covariates=config['dir_expression']+config['file_covariates'],
file_gene_meta='result/'+config['file_gene_meta'],
file_coexp_module='result/'+config['file_coexp_module'],
file_ex=config['dir_expression']+config['file_ex']
output:
file_expression=expand('result/expression.module{module}.bed.gz', module=MODULE)
params:
data_type='obs'
script:
'script/'+config['script_prep_bed']
rule z:
input:
expression_bed='result/expression.module{module}.bed.gz',
file_covariates=config['dir_expression']+config['file_covariates']
output:
file_z='z/z.module{module}.chr{chr}.txt.gz'
params:
plink_prefix_path=config['dir_geno']+config['geno_prefix']+'{chr}'+config['geno_suffix'],
prefix='module{module}.chr{chr}',
dir_script='script/'
shell: 'bash '+'script/'+config['script_z']+' {params.plink_prefix_path} {input.expression_bed} {input.file_covariates} {params.prefix} {params.dir_script} {output.file_z}'
rule p:
input:
file_ex_var_regressed='result/'+config['file_ex_var_regressed'],
file_gene_meta='result/'+config['file_gene_meta'],
file_coexp_module='result/'+config['file_coexp_module'],
file_z='z/z.module{module}.chr{chr}.txt.gz'
output:
file_p='p/p.module{module}.chr{chr}.rds'
params:
dir_script='script/', chr='{chr}', module='{module}'
script:
'script/'+config['script_p']
rule prep_bed_null:
input:
file_covariates=config['dir_expression']+config['file_covariates'],
file_gene_meta='result/'+config['file_gene_meta'],
file_coexp_module='result/'+config['file_coexp_module'],
file_ex=config['dir_expression']+config['file_ex']
output:
file_expression=temp(expand('result/expression.null.module{module}.perm{perm}.bed.gz', module=MODULE, allow_missing=True)),
file_covariates_null=temp('result/covariates.null.perm{perm}.txt')
params:
data_type='null'
script:
'script/'+config['script_prep_bed']
rule z_null:
input:
expression_bed='result/expression.null.module{module}.perm{perm}.bed.gz',
file_covariates='result/covariates.null.perm{perm}.txt'
output:
file_z=temp('z/z.null.module{module}.chr{chr}.perm{perm}.txt.gz')
params:
plink_prefix_path=config['dir_geno']+config['geno_prefix']+'{chr}'+config['geno_suffix'],
prefix='module{module}.chr{chr}.perm{perm}.null',
dir_script='script/'
shell: 'bash script/'+config['script_z']+' {params.plink_prefix_path} {input.expression_bed} {input.file_covariates} {params.prefix} {params.dir_script} {output.file_z}'
rule p_null:
input:
file_ex_var_regressed='result/'+config['file_ex_var_regressed'],
file_gene_meta='result/'+config['file_gene_meta'],
file_coexp_module='result/'+config['file_coexp_module'],
file_z='z/z.null.module{module}.chr{chr}.perm{perm}.txt.gz'
output:
file_p='p/p.null.module{module}.chr{chr}.perm{perm}.rds'
params:
dir_script='script/', chr='{chr}', module='{module}'
script:
'script/'+config['script_p']
rule FDR_chr_module:
input:
file_p=expand('p/p.module{module}.chr{chr}.rds',chr=CHRS,module=MODULE),
file_p_null=expand('p/p.null.module{module}.chr{chr}.perm{perm}.rds',chr=CHRS,module=MODULE, allow_missing=True)
output:
file_q='FDR/q.chr.module.perm{perm}.rds'
script:
'script/'+config['script_q']
rule average_perm_chr_module:
input:
file_q=expand('FDR/q.chr.module.perm{perm}.rds', perm=PERM)
output:
file_signals='FDR/signals.chr.module.perm'+str(config['Nperm'])+'.txt'
params:
fdr_thre=fdr_thre_chr_module
script:
'script/'+config['script_average_perm']
rule post_chr_module:
input: sig='FDR/signals.chr.module.perm'+str(config['Nperm'])+'.txt'
output: sig_uniq='postanalysis/LD.prun.in.chr.module.perm'+str(config['Nperm'])+'.txt', sig_indp='postanalysis/indep.signals.chr.module.perm'+str(config['Nperm'])+'.txt'
params: dir_geno=config['dir_geno'], geno_prefix=config['geno_prefix'], geno_suffix=config['geno_suffix']
shell: 'bash script/'+config['script_post']+' {input.sig} {output.sig_uniq} {output.sig_indp} {params.dir_geno} {params.geno_prefix} {params.geno_suffix}'
Session info
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS/LAPACK: /scratch/midway2/liliw1/conda_env/rstudio-server/lib/libopenblasp-r0.3.18.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
## [9] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.29 bslib_0.3.1 haven_2.4.3
## [5] colorspace_2.0-3 vctrs_0.4.1 generics_0.1.2 htmltools_0.5.2
## [9] yaml_2.2.2 utf8_1.2.2 rlang_1.0.3 jquerylib_0.1.4
## [13] pillar_1.7.0 withr_2.5.0 glue_1.6.2 DBI_1.1.3
## [17] dbplyr_2.1.1 modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.1
## [21] cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0 rvest_1.0.2
## [25] evaluate_0.14 knitr_1.37 tzdb_0.2.0 fastmap_1.1.0
## [29] fansi_1.0.3 broom_0.7.12 Rcpp_1.0.8.3 backports_1.4.1
## [33] scales_1.2.0 jsonlite_1.7.3 fs_1.5.2 hms_1.1.1
## [37] digest_0.6.29 stringi_1.7.6 grid_4.1.2 cli_3.3.0
## [41] tools_4.1.2 magrittr_2.0.3 sass_0.4.0 crayon_1.5.1
## [45] pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.1
## [49] lubridate_1.8.0 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2
## [53] rstudioapi_0.13 R6_2.5.1 compiler_4.1.2
References
Battle, Alexis, Sara Mostafavi, Xiaowei Zhu, James B. Potash, Myrna M. Weissman, Courtney McCormick, Christian D. Haudenschild, et al. 2014.
“Characterizing the Genetic Basis of Transcriptome Diversity Through RNA-Sequencing of 922 Individuals.” Genome Research 24 (1): 14–24.
https://doi.org/10.1101/gr.155192.113.