Quickstart.Rmd
rCOGS is a package for integrating GWAS summary statistics with the p romoter c apture Hi-C (pcHi-C) contact maps in order to prioritise genes for functional followup. COGS stands for C apture Hi-C O mnibus G ene S core and works best with pcHi-C datasets processed using CHiCAGO.
The hardest part of running rCOGS is making sure source files are available and in the correct format. With all files make sure they are relevant to the population that you are studying and use the correct genome build coordinates. Unless explicitly stated all coords are GRCh37 and non-zero based.
Full support files that are suitable for use with data from Javierre et al. are available in ./inst/extdata/
. Note that this vignette uses cutdown versions of these files filtered to use only chr22 to speed things up.
Depending on usage rCOGS requires the following files:
These could be computed using data from the International HapMap project or you can use more recent alternative regions from Berisa et al.. Note that your LD file should match the population that you are studying! If you are using a more targeted platform such as ImmunoChip then you should use the LD regions used to design the platform. What is key is that these regions are non overlapping, it does not matter if they are not contiguous (you will just lose associations that don’t overlap). The format is non-zero based BED format as follows
chr | start | end |
---|---|---|
1 | 1 | 888659 |
1 | 888660 | 1891263 |
1 | 1891264 | 2299651 |
In order to model whether a given SNP is causal rCOGS needs an estimate of the Minor allele Frequency (MAF) in a control set of samples. If you are using your own data you will have this information to hand. If on the other hand you are using summary statistics downloaded from a public resource such as GWAS Catalog you can use a reference set of genotypes such as 1KGenomes again it is important to make sure that these are relevant to the population that you are studying. The format is as follows and each variant to assessed should have a MAF in this file otherwise it will be excluded. Again position are non-zero based.
chr | pos | maf |
---|---|---|
1 | 713914 | 0.033325 |
1 | 714310 | 0.026845 |
1 | 715265 | 0.037027 |
These are the univariate p-values obtained from GWAS analysis, nowadays the fitting of regression models is mainly performed in PLINK and thus this output format is supported. If however you have used an alternative method to assess association then it should have the following format.
chr | pos | p |
---|---|---|
1 | 768253 | 0.617 |
1 | 781845 | 0.891 |
1 | 787606 | 0.871 |
1 | 787844 | 0.977 |
By far the most complicated data file this should have the following format with data taken from Javierre et al.. It is an enhanced version of the peakMatrix format output by CHiCAGO.
ensg | name | biotype | strand | baitChr | baitStart | baitEnd | baitID | baitName | oeChr | oeStart | oeEnd | oeID | oeName | dist | Monocytes | Macrophages_M0 | Macrophages_M1 | Macrophages_M2 | Neutrophils | Megakaryocytes | Endothelial_precursors | Erythroblasts | Foetal_thymus | Naive_CD4 | Total_CD4_MF | Total_CD4_Activated | Total_CD4_NonActivated | Naive_CD8 | Total_CD8 | Naive_B | Total_B |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100023790 | 100032508 | 813903 | . | 129849 | 1.2768325567372 | 3.36925757955278 | 3.50427556762941 | 5.77885715736681 | 0.71946319470011 | 1.57426075233114 | 0.935151981213395 | 2.07176252723988 | 0.518991005066328 | 1.35844284079167 | 1.80051450777455 | 2.14482233107626 | 2.81450087489112 | 1.0646936885784 | 0.329294941260539 | 2.31949322698618 | 2.91843729769059 |
ENSG00000000003 | TSPAN6 | protein_coding | - | X | 99894205 | 99902395 | 813869 | SRPX2;TSPAN6 | X | 100038149 | 100039492 | 813905 | . | 140521 | 2.22145436128498 | 2.34713825641161 | 1.94777355869935 | 1.8326996671119 | 1.19340351282302 | 4.71363557745529 | 1.75066400673428 | 3.10545714323329 | 1.8506228303492 | 1.52337854148901 | 1.92401193124468 | 1.4553080801738 | 0.959541835366562 | 1.47904009071165 | 2.33301812072822 | 5.09250926114187 | 3.5849349597336 |
This file represents the output of an in silico restriction digest for the restriction enzyme employed in the pcHi-C experiment. At the time that this was written this was Hind III. These regions should be zero based and match those in the pcHi-C file above, additionally, the fragid column should refer to the baitID and oeID columns in the pcHi-C dataset above.
chr | start | end | fragid |
---|---|---|---|
1 | 1 | 16007 | 1 |
1 | 16008 | 24571 | 2 |
1 | 24572 | 27981 | 3 |
1 | 27982 | 30429 | 4 |
This file is a list of all the fragments for which capture probes were designed and depends on what annotation you are using. The fragid column again should be compatible with the restriction digest file and pcHi-C file. The ensg should also map to the ensg file in pcHi-C data file. Note that a fragment can contain more than one promoter !
fragid | ensg |
---|---|
218 | ENSG00000272438 |
218 | ENSG00000230699 |
219 | ENSG00000241180 |
220 | ENSG00000268179 |
220 | ENSG00000223764 |
220 | ENSG00000187634 |
223 | ENSG00000187961 |
rCOGS integrates functional information where possible. To do this it needs a catalogue of exonic variants. This can be obtained from a tool such as VEP and needs to have the following format:
chr | pos | ensg |
---|---|---|
1 | 99883719 | ENSG00000000003 |
1 | 99883962 | ENSG00000000003 |
1 | 99884017 | ENSG00000000003 |
1 | 99884166 | ENSG00000000034 |
You will need to have the devtool R package installed
library(devtools) install_github("ollyburren/rCOGS")
library(rCOGS) #> Loading required package: data.table library(GenomicRanges) #> Loading required package: stats4 #> Loading required package: BiocGenerics #> Loading required package: parallel #> #> Attaching package: 'BiocGenerics' #> The following objects are masked from 'package:parallel': #> #> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, #> clusterExport, clusterMap, parApply, parCapply, parLapply, #> parLapplyLB, parRapply, parSapply, parSapplyLB #> The following objects are masked from 'package:stats': #> #> IQR, mad, sd, var, xtabs #> The following objects are masked from 'package:base': #> #> anyDuplicated, append, as.data.frame, basename, cbind, colnames, #> dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, #> grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, #> order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, #> rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, #> union, unique, unsplit, which, which.max, which.min #> Loading required package: S4Vectors #> #> Attaching package: 'S4Vectors' #> The following objects are masked from 'package:data.table': #> #> first, second #> The following object is masked from 'package:base': #> #> expand.grid #> Loading required package: IRanges #> #> Attaching package: 'IRanges' #> The following object is masked from 'package:data.table': #> #> shift #> Loading required package: GenomeInfoDb library(magrittr) library(knitr)
Here we load in ld regions - these are checked and output as a GRanges object
This is how to build objects - note due to size you will need to download these supporting objects into ./inst/extdata/
ld.gr <- load_ld_regions('../inst/extdata/hamap_1cM_recomb.bed') ld.gr
#> GRanges object with 69 ranges and 1 metadata column:
#> seqnames ranges strand | ldid
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] 22 1-16252600 * | 3385
#> [2] 22 16252601-17089639 * | 3386
#> [3] 22 17089640-17571604 * | 3387
#> [4] 22 17571605-17718606 * | 3388
#> [5] 22 17718607-17823261 * | 3389
#> ... ... ... ... . ...
#> [65] 22 49349502-49592082 * | 3449
#> [66] 22 49592083-49777002 * | 3450
#> [67] 22 49777003-49910517 * | 3451
#> [68] 22 49910518-50835040 * | 3452
#> [69] 22 50835041-536870911 * | 3453
#> -------
#> seqinfo: 22 sequences from an unspecified genome; no seqlengths
maf.DT <- load_ref_maf('../inst/extdata/uk10k_reference_maf.tab',min.maf=0.05) head(maf.DT)
#> chr pos maf pid
#> 1: 22 16051249 0.107115 22:16051249
#> 2: 22 16053444 0.072997 22:16053444
#> 3: 22 16054667 0.244909 22:16054667
#> 4: 22 16055122 0.085559 22:16055122
#> 5: 22 16057417 0.107908 22:16057417
#> 6: 22 16490618 0.342238 22:16490618
N.B here we have a study with 4036 cases and 6959 controls.
t1d.DT <- load_gwas('../inst/extdata/ncooper_t1d_gwas.tab',maf.DT,ld.gr,n.cases=5913,n.controls=8829) head(t1d.DT)
#> pid chr pos p maf ld ppi
#> 1: 22:16849941 22 16849941 0.798 0.449220 2 1.221142e-05
#> 2: 22:16851640 22 16851640 0.419 0.374769 2 1.675352e-05
#> 3: 22:16851673 22 16851673 0.419 0.374769 2 1.675352e-05
#> 4: 22:16851899 22 16851899 0.419 0.374372 2 1.675700e-05
#> 5: 22:16852303 22 16852303 0.395 0.368024 2 1.740740e-05
#> 6: 22:16852914 22 16852914 0.411 0.368553 2 1.699918e-05
Notice that not all variants are imported, some are ommited as they are below the MAF filter (5%) used in the previous step others don’t overlap an LD regions.
note the biotype.filter argument. This can be a vector of biotypes or left blank if everything is required.
feature.sets <- make_pchic('../inst/extdata/javierre_pchic_ensembl75.tab',biotype.filter='protein_coding') names(feature.sets)
#> [1] "Monocytes" "Macrophages_M0" "Macrophages_M1"
#> [4] "Macrophages_M2" "Neutrophils" "Megakaryocytes"
#> [7] "Endothelial_precursors" "Erythroblasts" "Foetal_thymus"
#> [10] "Naive_CD4" "Total_CD4_MF" "Total_CD4_Activated"
#> [13] "Total_CD4_NonActivated" "Naive_CD8" "Total_CD8"
#> [16] "Naive_B" "Total_B" "VProm"
Note that we obtain a list where each element is a data.table
head(feature.sets[[1]]) #> ensg fragid chic.score #> 1: ENSG00000015475 429638 5.683019 #> 2: ENSG00000015475 429639 6.259106 #> 3: ENSG00000015475 429641 5.181800 #> 4: ENSG00000015475 429651 5.261521 #> 5: ENSG00000015475 429663 5.580373 #> 6: ENSG00000015475 429743 5.604308
All Hi-C technologies struggle when assessing interactions between adjacent restricition fragments as it is impossible to resolve what is due to stochastic Brownian motion and which are short range interactions. To offset this COGS has a routine that creates a `Virtual’ promter region that consists of a baited fragment and the adjacent 5’ and 3’ fragments. It takes the digest and design file as arguments.
We add this to the list of regions.
## get a list of all genes included all.genes <- lapply(feature.sets,function(g) unique(g$ensg)) %>% do.call('c',.) %>% unique feature.sets[['VProm']] <- make_vprom('../inst/extdata/hindIII_digest_grch37.tab','./inst/extdata/javierre_design_ensembl75.tab',all.genes)
returns a genomic ranges object
csnps.gr <- make_csnps('../inst/extdata/coding_snps_vep_ensembl75.tab') head(csnps.gr)
#> GRanges object with 6 ranges and 2 metadata columns:
#> seqnames ranges strand | fragid ensg
#> <Rle> <IRanges> <Rle> | <numeric> <character>
#> [1] 22 17264320 * | 0 ENSG00000172967
#> [2] 22 17264341 * | 0 ENSG00000172967
#> [3] 22 17264369 * | 0 ENSG00000172967
#> [4] 22 17264400 * | 0 ENSG00000172967
#> [5] 22 17264498 * | 0 ENSG00000172967
#> [6] 22 17264565 * | 0 ENSG00000172967
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
The fundamental unit of a pcHi-C map is a restriction fragment this saves us a lot of computation as we can precompute the support for each fragment to be causal once and then combine to get scores.
digest.gr <- load_digest('../inst/extdata/hindIII_digest_grch37.tab') head(digest.gr)
#> GRanges object with 6 ranges and 1 metadata column:
#> seqnames ranges strand | fragid
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] 22 1-16058012 * | 429254
#> [2] 22 16058013-16058591 * | 429255
#> [3] 22 16058592-16060021 * | 429256
#> [4] 22 16060022-16067173 * | 429257
#> [5] 22 16067174-16071272 * | 429258
#> [6] 22 16071273-16073271 * | 429259
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
MHC/HLA has an extremely complex and long LD structure and the genetic finemapping method employed is unable to effectively deal with it, thus it is best to exclude and analyse separately.
overall.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets) #> rCOGS:compute_cogs feature.names argument missing computing overall score using Monocytes, #> Macrophages_M0, #> Macrophages_M1, #> Macrophages_M2, #> Neutrophils, #> Megakaryocytes, #> Endothelial_precursors, #> Erythroblasts, #> Foetal_thymus, #> Naive_CD4, #> Total_CD4_MF, #> Total_CD4_Activated, #> Total_CD4_NonActivated, #> Naive_CD8, #> Total_CD8, #> Naive_B, #> Total_B, #> VProm, #> coding_snp head(overall.scores[order(cogs,decreasing = TRUE),]) %>% kable
ensg | cogs |
---|---|
ENSG00000187045 | 0.9569800 |
ENSG00000128342 | 0.6052610 |
ENSG00000100385 | 0.2368662 |
ENSG00000185133 | 0.2196913 |
ENSG00000198832 | 0.2192132 |
ENSG00000099954 | 0.2183529 |
target_tissue<-c('Total_CD4_Activated','Total_CD4_NonActivated','Naive_CD4','Total_CD4_MF','Naive_CD8','Total_CD8') tcell.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets,target_tissue) %>% head(.,) head(tcell.scores[order(cogs,decreasing = TRUE),]) %>% kable
ensg | cogs |
---|---|
ENSG00000099954 | 0.2106387 |
ENSG00000099968 | 0.0401019 |
ENSG00000131100 | 0.0401019 |
ENSG00000069998 | 0.0056926 |
ENSG00000172967 | 0.0007576 |
ENSG00000182902 | 0.0000110 |