Goal

To evaluate the enrichment of trait heritability in gene modules by stratified LD score regression (S_LDSC).

Perform S-LDSC

I compiled the workflow to perform S-LDSc into a snakemake file. Specifically, there are a few main steps,

  1. Prepare gene sets for each module.

  2. Make annotations for genes in each module by using SNPs within genes and an additional buffer region (100kb) around genes.

  3. Calcualte LD scores for annotations.

  4. Reformat GWAS sum stats.

  5. Calculate the partitioned heritability enrichment for each pair of (module, trait).

Nmodule=166

module_seq=list(range(1, Nmodule+1))
module_seq.remove(4)
module_seq.remove(66)

MODULE=['M' + str(x) for x in module_seq]
CHR=list(range(1, 23))


path='/project2/xuanyao/llw/coloc/ukbb_coloc_blood_traits/ukbb_blood_traits.csv'
with open(path) as f:
    lines = [x.split(",")[1] for x in f]
trait_seq=lines[1:]


rule all:
  input:
    expand('h2_enrich_par/{gwasPhenocode}_{module}_baseline.results', module=MODULE, gwasPhenocode=trait_seq)


rule prep_gene_set:
  input:
    file_gene_meta='/project2/xuanyao/llw/DGN_no_filter_on_mappability/result/gene.meta.txt'
    file_coexp_module='/project2/xuanyao/llw/DGN_no_filter_on_mappability/result/coexp.module.rds'
  output:
    expand('geneset/{module}.GeneSet', module=MODULE)
  script:
    '/home/liliw1/Trans/ldsc/1_prep_gene_set.R'


rule make_annot:
  input:
    expand('geneset/{module}.GeneSet', module=MODULE)
  output:
    expand('ldsc_annot/{module}.{chr}.annot.gz', chr=CHR, allow_missing=True)
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/2_make_annot.sh {wildcards.module}
    """

rule ldsc_annot:
  input:
    expand('ldsc_annot/{module}.{chr}.annot.gz', chr=CHR, allow_missing=True)
  output:
    expand('ldsc_annot/{module}.{chr}.l2.ldscore.gz', chr=CHR, allow_missing=True)
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/3_ldsc_annot.sh {wildcards.module}
    """

rule prep_gwas:
  input:
  params:
    gwasPhenocode='{gwasPhenocode}'
  output:
    'gwas/{gwasPhenocode}.tsv.gz'
  script:
    '/home/liliw1/Trans/ldsc/5_1_prep_gwas.R'

rule convert_gwas:
  input:
    'gwas/{gwasPhenocode}.tsv.gz'
  output:
    'gwas/{gwasPhenocode}.sumstats.gz'
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/5_2_convert_gwas.sh {wildcards.gwasPhenocode}
    """

rule par_h2:
  input:
    expand('ldsc_annot/{module}.{chr}.annot.gz', chr=CHR, allow_missing=True),
    expand('ldsc_annot/{module}.{chr}.l2.ldscore.gz', chr=CHR, allow_missing=True),
    expand('gwas/{gwasPhenocode}.sumstats.gz', gwasPhenocode=trait_seq)
  output:
    expand('h2_enrich_par/{gwasPhenocode}_{module}_baseline.results', gwasPhenocode=trait_seq, allow_missing=True)
  shell:
    """
    source activate ldsc
    bash /home/liliw1/Trans/ldsc/6_1_ldsc_h2.sh {wildcards.module}
    """

Assemble heritability enrichment in a module across all traits

Next, I compiled the S-LDSC results of a module for all traits into one file. In total, there are 166 modules. Here, I take one module (module 4) for an example.

# paras and I/O -----
module <- 4

files_par_h2 <- list.files('/project2/xuanyao/llw/ldsc/h2_enrich_par', paste0("^\\d+_M", module, "_baseline.results"), full.names = TRUE)
file_gwasTraitInfo <- '/project2/xuanyao/llw/coloc/ukbb_coloc_blood_traits/ukbb_blood_traits.csv'

# read data -----
par_h2 <- map_dfr(
  files_par_h2,
  ~fread(cmd = paste("sed -n -e 1p -e 99p", .x), sep = "\t")
)
if(!all(par_h2$Category == "L2_1")) stop("Not all extracted rows are from custom annotation.")

gwasTraitInfo <- fread(file_gwasTraitInfo, sep = ",", header = TRUE)


# re-arrange data -----
# add annotation name & trait info
par_h2 <- par_h2 %>%
  mutate(Category = str_extract(basename(!!files_par_h2), paste0("^\\d+_M", !!module))) %>%
  separate(Category, c("trait_id", "module"), sep = "_", remove = FALSE, convert = TRUE) %>%
  separate(module, c(NA, "module"), sep = "M", convert = TRUE) %>%
  left_join(gwasTraitInfo, by = c("trait_id" = "GWAS ID")) %>%
  relocate(`GWAS Group`, `GWAS Trait`, `Trait Abbreviation`, .after = trait_id)

The enrichment of module 4 across traits,

filter(par_h2, complete.cases(par_h2)) %>%
  select(trait_id, `GWAS Group`, `GWAS Trait`, Enrichment, Enrichment_std_error, Enrichment_p) %>%
  arrange(Enrichment_p) %>%
  knitr::kable()
trait_id GWAS Group GWAS Trait Enrichment Enrichment_std_error Enrichment_p
30100 Platelets Mean platelet volume 6.6765352 1.2648330 0.0000122
30080 Platelets Platelet count 3.9722416 0.6876781 0.0000149
30110 Platelets Platelet distribution width 6.4866017 1.3775005 0.0000703
30000 White blood cells White blood cell count 1.7044407 0.2371545 0.0031285
30090 Platelets Platelet crit 2.6319136 0.5896392 0.0054441
30070 Red blood cells Red blood cell distribution width 2.4668573 0.5518419 0.0086231
30180 White blood cells Lymphocyte percentage 1.9901747 0.3821117 0.0096199
30140 White blood cells Neutrophill count 1.7197007 0.2894397 0.0136088
30010 Red blood cells Red blood cell count 2.0065541 0.4148286 0.0150241
30120 White blood cells Lymphocyte count 1.8146372 0.3372063 0.0177686
30250 Red blood cells Reticulocyte count 2.7844407 0.7293367 0.0202140
30200 White blood cells Neutrophill percentage 1.9836064 0.4284491 0.0209660
30240 Red blood cells Reticulocyte percentage 2.9140927 0.7922800 0.0219897
30300 Red blood cells High light scatter reticulocyte count 2.6861645 0.7292037 0.0274268
30290 Red blood cells High light scatter reticulocyte percentage 2.7459781 0.7675678 0.0300567
30020 Red blood cells Haemoglobin concentration 1.6980012 0.3276797 0.0317306
30270 Red blood cells Mean sphered cell volume 1.9486126 0.4529099 0.0363589
30030 Red blood cells Haematocrit percentage 1.7133299 0.3475986 0.0385278
30260 Red blood cells Mean reticulocyte volume 2.2668847 0.6506316 0.0490014
30050 Red blood cells Mean corpuscular haemoglobin 2.7048509 0.8835458 0.0561511
30040 Red blood cells Mean corpuscular volume 2.5563837 0.8236809 0.0617733
30280 Red blood cells Immature reticulocyte fraction 2.7713588 0.9290462 0.0666886
30190 White blood cells Monocyte percentage 1.4886842 0.3374385 0.1352258
30130 White blood cells Monocyte count 1.3673806 0.2700425 0.1624942
30060 Red blood cells Mean corpuscular haemoglobin concentration 1.7191593 0.6399688 0.2597777
30210 White blood cells Eosinophill percentage 0.7715102 0.2304061 0.3313856
30150 White blood cells Eosinophill count 0.8082552 0.2608841 0.4680572
30220 White blood cells Basophill percentage 1.1889507 0.4896314 0.6946688
30160 White blood cells Basophill count 0.8833706 0.4909206 0.8118197

Session info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /scratch/midway2/liliw1/conda_env/rstudio-server/lib/libopenblasp-r0.3.18.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] data.table_1.14.2 forcats_0.5.1     stringr_1.4.0     dplyr_1.0.7      
##  [5] purrr_0.3.4       readr_2.1.2       tidyr_1.2.0       tibble_3.1.7     
##  [9] ggplot2_3.3.6     tidyverse_1.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1 xfun_0.29        bslib_0.3.1      haven_2.4.3     
##  [5] colorspace_2.0-3 vctrs_0.4.1      generics_0.1.2   htmltools_0.5.2 
##  [9] yaml_2.2.2       utf8_1.2.2       rlang_1.0.3      jquerylib_0.1.4 
## [13] pillar_1.7.0     withr_2.5.0      glue_1.6.2       DBI_1.1.3       
## [17] dbplyr_2.1.1     modelr_0.1.8     readxl_1.3.1     lifecycle_1.0.1 
## [21] cellranger_1.1.0 munsell_0.5.0    gtable_0.3.0     rvest_1.0.2     
## [25] evaluate_0.14    knitr_1.37       tzdb_0.2.0       fastmap_1.1.0   
## [29] fansi_1.0.3      highr_0.9        broom_0.7.12     Rcpp_1.0.8.3    
## [33] backports_1.4.1  scales_1.2.0     jsonlite_1.7.3   fs_1.5.2        
## [37] hms_1.1.1        digest_0.6.29    stringi_1.7.6    grid_4.1.2      
## [41] cli_3.3.0        tools_4.1.2      magrittr_2.0.3   sass_0.4.0      
## [45] crayon_1.5.1     pkgconfig_2.0.3  ellipsis_0.3.2   xml2_1.3.3      
## [49] reprex_2.0.1     lubridate_1.8.0  assertthat_0.2.1 rmarkdown_2.11  
## [53] httr_1.4.2       rstudioapi_0.13  R6_2.5.1         compiler_4.1.2