Enrichment of trait heritability in gene modules

Goal

To evaluate the enrichment of trait heritability in gene modules by stratified LD score regression (S_LDSC).

Perform S-LDSC

I compiled the workflow to perform S-LDSc into a snakemake file. Specifically, there are a few main steps,

Prepare gene sets for each module.
Make annotations for genes in each module by using SNPs within genes and an additional buffer region (100kb) around genes.
Calcualte LD scores for annotations.
Reformat GWAS sum stats.
Calculate the partitioned heritability enrichment for each pair of (module, trait).

Nmodule=166

module_seq=list(range(1, Nmodule+1))
module_seq.remove(4)
module_seq.remove(66)

MODULE=['M' + str(x) for x in module_seq]
CHR=list(range(1, 23))


path='/project2/xuanyao/llw/coloc/ukbb_coloc_blood_traits/ukbb_blood_traits.csv'
with open(path) as f:
    lines = [x.split(",")[1] for x in f]
trait_seq=lines[1:]


rule all:
  input:
    expand('h2_enrich_par/{gwasPhenocode}_{module}_baseline.results', module=MODULE, gwasPhenocode=trait_seq)


rule prep_gene_set:
  input:
    file_gene_meta='/project2/xuanyao/llw/DGN_no_filter_on_mappability/result/gene.meta.txt'
    file_coexp_module='/project2/xuanyao/llw/DGN_no_filter_on_mappability/result/coexp.module.rds'
  output:
    expand('geneset/{module}.GeneSet', module=MODULE)
  script:
    '/home/liliw1/Trans/ldsc/1_prep_gene_set.R'


rule make_annot:
  input:
    expand('geneset/{module}.GeneSet', module=MODULE)
  output:
    expand('ldsc_annot/{module}.{chr}.annot.gz', chr=CHR, allow_missing=True)
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/2_make_annot.sh {wildcards.module}
    """

rule ldsc_annot:
  input:
    expand('ldsc_annot/{module}.{chr}.annot.gz', chr=CHR, allow_missing=True)
  output:
    expand('ldsc_annot/{module}.{chr}.l2.ldscore.gz', chr=CHR, allow_missing=True)
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/3_ldsc_annot.sh {wildcards.module}
    """

rule prep_gwas:
  input:
  params:
    gwasPhenocode='{gwasPhenocode}'
  output:
    'gwas/{gwasPhenocode}.tsv.gz'
  script:
    '/home/liliw1/Trans/ldsc/5_1_prep_gwas.R'

rule convert_gwas:
  input:
    'gwas/{gwasPhenocode}.tsv.gz'
  output:
    'gwas/{gwasPhenocode}.sumstats.gz'
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/5_2_convert_gwas.sh {wildcards.gwasPhenocode}
    """

rule par_h2:
  input:
    expand('ldsc_annot/{module}.{chr}.annot.gz', chr=CHR, allow_missing=True),
    expand('ldsc_annot/{module}.{chr}.l2.ldscore.gz', chr=CHR, allow_missing=True),
    expand('gwas/{gwasPhenocode}.sumstats.gz', gwasPhenocode=trait_seq)
  output:
    expand('h2_enrich_par/{gwasPhenocode}_{module}_baseline.results', gwasPhenocode=trait_seq, allow_missing=True)
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/6_1_ldsc_h2.sh {wildcards.module}
    """

Assemble heritability enrichment in a module across all traits

Next, I compiled the S-LDSC results of a module for all traits into one file. In total, there are 166 modules. Here, I take one module (module 4) for an example.

# paras and I/O -----
module <- 4

files_par_h2 <- list.files('/project2/xuanyao/llw/ldsc/h2_enrich_par', paste0("^\\d+_M", module, "_baseline.results"), full.names = TRUE)
file_gwasTraitInfo <- '/project2/xuanyao/llw/coloc/ukbb_coloc_blood_traits/ukbb_blood_traits.csv'

# read data -----
par_h2 <- map_dfr(
  files_par_h2,
  ~fread(cmd = paste("sed -n -e 1p -e 99p", .x), sep = "\t")
)
if(!all(par_h2$Category == "L2_1")) stop("Not all extracted rows are from custom annotation.")

gwasTraitInfo <- fread(file_gwasTraitInfo, sep = ",", header = TRUE)


# re-arrange data -----
# add annotation name & trait info
par_h2 <- par_h2 %>%
  mutate(Category = str_extract(basename(!!files_par_h2), paste0("^\\d+_M", !!module))) %>%
  separate(Category, c("trait_id", "module"), sep = "_", remove = FALSE, convert = TRUE) %>%
  separate(module, c(NA, "module"), sep = "M", convert = TRUE) %>%
  left_join(gwasTraitInfo, by = c("trait_id" = "GWAS ID")) %>%
  relocate(`GWAS Group`, `GWAS Trait`, `Trait Abbreviation`, .after = trait_id)

The enrichment of module 4 across traits,

filter(par_h2, complete.cases(par_h2)) %>%
  select(trait_id, `GWAS Group`, `GWAS Trait`, Enrichment, Enrichment_std_error, Enrichment_p) %>%
  arrange(Enrichment_p) %>%
  knitr::kable()

trait_id	GWAS Group	GWAS Trait	Enrichment	Enrichment_std_error	Enrichment_p
30100	Platelets	Mean platelet volume	6.6765352	1.2648330	0.0000122
30080	Platelets	Platelet count	3.9722416	0.6876781	0.0000149
30110	Platelets	Platelet distribution width	6.4866017	1.3775005	0.0000703
30000	White blood cells	White blood cell count	1.7044407	0.2371545	0.0031285
30090	Platelets	Platelet crit	2.6319136	0.5896392	0.0054441
30070	Red blood cells	Red blood cell distribution width	2.4668573	0.5518419	0.0086231
30180	White blood cells	Lymphocyte percentage	1.9901747	0.3821117	0.0096199
30140	White blood cells	Neutrophill count	1.7197007	0.2894397	0.0136088
30010	Red blood cells	Red blood cell count	2.0065541	0.4148286	0.0150241
30120	White blood cells	Lymphocyte count	1.8146372	0.3372063	0.0177686
30250	Red blood cells	Reticulocyte count	2.7844407	0.7293367	0.0202140
30200	White blood cells	Neutrophill percentage	1.9836064	0.4284491	0.0209660
30240	Red blood cells	Reticulocyte percentage	2.9140927	0.7922800	0.0219897
30300	Red blood cells	High light scatter reticulocyte count	2.6861645	0.7292037	0.0274268
30290	Red blood cells	High light scatter reticulocyte percentage	2.7459781	0.7675678	0.0300567
30020	Red blood cells	Haemoglobin concentration	1.6980012	0.3276797	0.0317306
30270	Red blood cells	Mean sphered cell volume	1.9486126	0.4529099	0.0363589
30030	Red blood cells	Haematocrit percentage	1.7133299	0.3475986	0.0385278
30260	Red blood cells	Mean reticulocyte volume	2.2668847	0.6506316	0.0490014
30050	Red blood cells	Mean corpuscular haemoglobin	2.7048509	0.8835458	0.0561511
30040	Red blood cells	Mean corpuscular volume	2.5563837	0.8236809	0.0617733
30280	Red blood cells	Immature reticulocyte fraction	2.7713588	0.9290462	0.0666886
30190	White blood cells	Monocyte percentage	1.4886842	0.3374385	0.1352258
30130	White blood cells	Monocyte count	1.3673806	0.2700425	0.1624942
30060	Red blood cells	Mean corpuscular haemoglobin concentration	1.7191593	0.6399688	0.2597777
30210	White blood cells	Eosinophill percentage	0.7715102	0.2304061	0.3313856
30150	White blood cells	Eosinophill count	0.8082552	0.2608841	0.4680572
30220	White blood cells	Basophill percentage	1.1889507	0.4896314	0.6946688
30160	White blood cells	Basophill count	0.8833706	0.4909206	0.8118197

Session info

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /scratch/midway2/liliw1/conda_env/rstudio-server/lib/libopenblasp-r0.3.18.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] data.table_1.14.2 forcats_0.5.1     stringr_1.4.0     dplyr_1.0.7      
##  [5] purrr_0.3.4       readr_2.1.2       tidyr_1.2.0       tibble_3.1.7     
##  [9] ggplot2_3.3.6     tidyverse_1.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1 xfun_0.29        bslib_0.3.1      haven_2.4.3     
##  [5] colorspace_2.0-3 vctrs_0.4.1      generics_0.1.2   htmltools_0.5.2 
##  [9] yaml_2.2.2       utf8_1.2.2       rlang_1.0.3      jquerylib_0.1.4 
## [13] pillar_1.7.0     withr_2.5.0      glue_1.6.2       DBI_1.1.3       
## [17] dbplyr_2.1.1     modelr_0.1.8     readxl_1.3.1     lifecycle_1.0.1 
## [21] cellranger_1.1.0 munsell_0.5.0    gtable_0.3.0     rvest_1.0.2     
## [25] evaluate_0.14    knitr_1.37       tzdb_0.2.0       fastmap_1.1.0   
## [29] fansi_1.0.3      highr_0.9        broom_0.7.12     Rcpp_1.0.8.3    
## [33] backports_1.4.1  scales_1.2.0     jsonlite_1.7.3   fs_1.5.2        
## [37] hms_1.1.1        digest_0.6.29    stringi_1.7.6    grid_4.1.2      
## [41] cli_3.3.0        tools_4.1.2      magrittr_2.0.3   sass_0.4.0      
## [45] crayon_1.5.1     pkgconfig_2.0.3  ellipsis_0.3.2   xml2_1.3.3      
## [49] reprex_2.0.1     lubridate_1.8.0  assertthat_0.2.1 rmarkdown_2.11  
## [53] httr_1.4.2       rstudioapi_0.13  R6_2.5.1         compiler_4.1.2