rCOGS Quickstart

What is rCOGS ?

rCOGS is a package for integrating GWAS summary statistics with the p romoter c apture Hi-C (pcHi-C) contact maps in order to prioritise genes for functional followup. COGS stands for C apture Hi-C O mnibus G ene S core and works best with pcHi-C datasets processed using CHiCAGO.

What does rCOGS require ?

The hardest part of running rCOGS is making sure source files are available and in the correct format. With all files make sure they are relevant to the population that you are studying and use the correct genome build coordinates. Unless explicitly stated all coords are GRCh37 and non-zero based.

Full support files that are suitable for use with data from Javierre et al. are available in ./inst/extdata/. Note that this vignette uses cutdown versions of these files filtered to use only chr22 to speed things up.

Depending on usage rCOGS requires the following files:

approximate linkage disequilibrium independent region files

These could be computed using data from the International HapMap project or you can use more recent alternative regions from Berisa et al.. Note that your LD file should match the population that you are studying! If you are using a more targeted platform such as ImmunoChip then you should use the LD regions used to design the platform. What is key is that these regions are non overlapping, it does not matter if they are not contiguous (you will just lose associations that don’t overlap). The format is non-zero based BED format as follows

chr	start	end
1	1	888659
1	888660	1891263
1	1891264	2299651

Minor allele frequency estimates in controls

In order to model whether a given SNP is causal rCOGS needs an estimate of the Minor allele Frequency (MAF) in a control set of samples. If you are using your own data you will have this information to hand. If on the other hand you are using summary statistics downloaded from a public resource such as GWAS Catalog you can use a reference set of genotypes such as 1KGenomes again it is important to make sure that these are relevant to the population that you are studying. The format is as follows and each variant to assessed should have a MAF in this file otherwise it will be excluded. Again position are non-zero based.

chr	pos	maf
1	713914	0.033325
1	714310	0.026845
1	715265	0.037027

Trait association statistics

These are the univariate p-values obtained from GWAS analysis, nowadays the fitting of regression models is mainly performed in PLINK and thus this output format is supported. If however you have used an alternative method to assess association then it should have the following format.

chr	pos	p
1	768253	0.617
1	781845	0.891
1	787606	0.871
1	787844	0.977

pcHi-C data

By far the most complicated data file this should have the following format with data taken from Javierre et al.. It is an enhanced version of the peakMatrix format output by CHiCAGO.

ensg	name	biotype	strand	baitChr	baitStart	baitEnd	baitID	baitName	oeChr	oeStart	oeEnd	oeID	oeName	dist	Monocytes	Macrophages_M0	Macrophages_M1	Macrophages_M2	Neutrophils	Megakaryocytes	Endothelial_precursors	Erythroblasts	Foetal_thymus	Naive_CD4	Total_CD4_MF	Total_CD4_Activated	Total_CD4_NonActivated	Naive_CD8	Total_CD8	Naive_B	Total_B
ENSG00000000003	TSPAN6	protein_coding	-	X	99894205	99902395	813869	SRPX2;TSPAN6	X	100023790	100032508	813903	.	129849	1.2768325567372	3.36925757955278	3.50427556762941	5.77885715736681	0.71946319470011	1.57426075233114	0.935151981213395	2.07176252723988	0.518991005066328	1.35844284079167	1.80051450777455	2.14482233107626	2.81450087489112	1.0646936885784	0.329294941260539	2.31949322698618	2.91843729769059
ENSG00000000003	TSPAN6	protein_coding	-	X	99894205	99902395	813869	SRPX2;TSPAN6	X	100038149	100039492	813905	.	140521	2.22145436128498	2.34713825641161	1.94777355869935	1.8326996671119	1.19340351282302	4.71363557745529	1.75066400673428	3.10545714323329	1.8506228303492	1.52337854148901	1.92401193124468	1.4553080801738	0.959541835366562	1.47904009071165	2.33301812072822	5.09250926114187	3.5849349597336

Restriction fragment digest file

This file represents the output of an in silico restriction digest for the restriction enzyme employed in the pcHi-C experiment. At the time that this was written this was Hind III. These regions should be zero based and match those in the pcHi-C file above, additionally, the fragid column should refer to the baitID and oeID columns in the pcHi-C dataset above.

chr	start	end	fragid
1	1	16007	1
1	16008	24571	2
1	24572	27981	3
1	27982	30429	4

pcHi-C design/annotation file

This file is a list of all the fragments for which capture probes were designed and depends on what annotation you are using. The fragid column again should be compatible with the restriction digest file and pcHi-C file. The ensg should also map to the ensg file in pcHi-C data file. Note that a fragment can contain more than one promoter !

fragid	ensg
218	ENSG00000272438
218	ENSG00000230699
219	ENSG00000241180
220	ENSG00000268179
220	ENSG00000223764
220	ENSG00000187634
223	ENSG00000187961

Coding SNPs

rCOGS integrates functional information where possible. To do this it needs a catalogue of exonic variants. This can be obtained from a tool such as VEP and needs to have the following format:

chr	pos	ensg
1	99883719	ENSG00000000003
1	99883962	ENSG00000000003
1	99884017	ENSG00000000003
1	99884166	ENSG00000000034

Installing COGS

You will need to have the devtool R package installed

library(devtools)
install_github("ollyburren/rCOGS")

library(rCOGS)
#> Loading required package: data.table
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:data.table':
#> 
#>     first, second
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:data.table':
#> 
#>     shift
#> Loading required package: GenomeInfoDb
library(magrittr)
library(knitr)

loading recombination data

Here we load in ld regions - these are checked and output as a GRanges object

This is how to build objects - note due to size you will need to download these supporting objects into ./inst/extdata/

ld.gr <- load_ld_regions('../inst/extdata/hamap_1cM_recomb.bed')
ld.gr

#> GRanges object with 69 ranges and 1 metadata column:
#>        seqnames             ranges strand |      ldid
#>           <Rle>          <IRanges>  <Rle> | <integer>
#>    [1]       22         1-16252600      * |      3385
#>    [2]       22  16252601-17089639      * |      3386
#>    [3]       22  17089640-17571604      * |      3387
#>    [4]       22  17571605-17718606      * |      3388
#>    [5]       22  17718607-17823261      * |      3389
#>    ...      ...                ...    ... .       ...
#>   [65]       22  49349502-49592082      * |      3449
#>   [66]       22  49592083-49777002      * |      3450
#>   [67]       22  49777003-49910517      * |      3451
#>   [68]       22  49910518-50835040      * |      3452
#>   [69]       22 50835041-536870911      * |      3453
#>   -------
#>   seqinfo: 22 sequences from an unspecified genome; no seqlengths

loading MAF data filter at 5%

maf.DT <- load_ref_maf('../inst/extdata/uk10k_reference_maf.tab',min.maf=0.05)
head(maf.DT)

#>    chr      pos      maf         pid
#> 1:  22 16051249 0.107115 22:16051249
#> 2:  22 16053444 0.072997 22:16053444
#> 3:  22 16054667 0.244909 22:16054667
#> 4:  22 16055122 0.085559 22:16055122
#> 5:  22 16057417 0.107908 22:16057417
#> 6:  22 16490618 0.342238 22:16490618

load association data

N.B here we have a study with 4036 cases and 6959 controls.

t1d.DT <- load_gwas('../inst/extdata/ncooper_t1d_gwas.tab',maf.DT,ld.gr,n.cases=5913,n.controls=8829)
head(t1d.DT)

#>            pid chr      pos     p      maf ld          ppi
#> 1: 22:16849941  22 16849941 0.798 0.449220  2 1.221142e-05
#> 2: 22:16851640  22 16851640 0.419 0.374769  2 1.675352e-05
#> 3: 22:16851673  22 16851673 0.419 0.374769  2 1.675352e-05
#> 4: 22:16851899  22 16851899 0.419 0.374372  2 1.675700e-05
#> 5: 22:16852303  22 16852303 0.395 0.368024  2 1.740740e-05
#> 6: 22:16852914  22 16852914 0.411 0.368553  2 1.699918e-05

Notice that not all variants are imported, some are ommited as they are below the MAF filter (5%) used in the previous step others don’t overlap an LD regions.

load in pcHi-C map

note the biotype.filter argument. This can be a vector of biotypes or left blank if everything is required.

feature.sets <- make_pchic('../inst/extdata/javierre_pchic_ensembl75.tab',biotype.filter='protein_coding')
names(feature.sets)

#>  [1] "Monocytes"              "Macrophages_M0"         "Macrophages_M1"        
#>  [4] "Macrophages_M2"         "Neutrophils"            "Megakaryocytes"        
#>  [7] "Endothelial_precursors" "Erythroblasts"          "Foetal_thymus"         
#> [10] "Naive_CD4"              "Total_CD4_MF"           "Total_CD4_Activated"   
#> [13] "Total_CD4_NonActivated" "Naive_CD8"              "Total_CD8"             
#> [16] "Naive_B"                "Total_B"                "VProm"

Note that we obtain a list where each element is a data.table

head(feature.sets[[1]])
#>               ensg fragid chic.score
#> 1: ENSG00000015475 429638   5.683019
#> 2: ENSG00000015475 429639   6.259106
#> 3: ENSG00000015475 429641   5.181800
#> 4: ENSG00000015475 429651   5.261521
#> 5: ENSG00000015475 429663   5.580373
#> 6: ENSG00000015475 429743   5.604308

create Virtual Promoter regions

All Hi-C technologies struggle when assessing interactions between adjacent restricition fragments as it is impossible to resolve what is due to stochastic Brownian motion and which are short range interactions. To offset this COGS has a routine that creates a `Virtual’ promter region that consists of a baited fragment and the adjacent 5’ and 3’ fragments. It takes the digest and design file as arguments.

We add this to the list of regions.

## get a list of all genes included
all.genes <- lapply(feature.sets,function(g) unique(g$ensg)) %>% do.call('c',.) %>% unique
feature.sets[['VProm']] <- make_vprom('../inst/extdata/hindIII_digest_grch37.tab','./inst/extdata/javierre_design_ensembl75.tab',all.genes)

load in coding SNPS

returns a genomic ranges object

csnps.gr <- make_csnps('../inst/extdata/coding_snps_vep_ensembl75.tab')
head(csnps.gr)

#> GRanges object with 6 ranges and 2 metadata columns:
#>       seqnames    ranges strand |    fragid            ensg
#>          <Rle> <IRanges>  <Rle> | <numeric>     <character>
#>   [1]       22  17264320      * |         0 ENSG00000172967
#>   [2]       22  17264341      * |         0 ENSG00000172967
#>   [3]       22  17264369      * |         0 ENSG00000172967
#>   [4]       22  17264400      * |         0 ENSG00000172967
#>   [5]       22  17264498      * |         0 ENSG00000172967
#>   [6]       22  17264565      * |         0 ENSG00000172967
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

load Digest file

The fundamental unit of a pcHi-C map is a restriction fragment this saves us a lot of computation as we can precompute the support for each fragment to be causal once and then combine to get scores.

digest.gr <- load_digest('../inst/extdata/hindIII_digest_grch37.tab')
head(digest.gr)

#> GRanges object with 6 ranges and 1 metadata column:
#>       seqnames            ranges strand |    fragid
#>          <Rle>         <IRanges>  <Rle> | <integer>
#>   [1]       22        1-16058012      * |    429254
#>   [2]       22 16058013-16058591      * |    429255
#>   [3]       22 16058592-16060021      * |    429256
#>   [4]       22 16060022-16067173      * |    429257
#>   [5]       22 16067174-16071272      * |    429258
#>   [6]       22 16071273-16073271      * |    429259
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Optional(Remove MHC)

MHC/HLA has an extremely complex and long LD structure and the genetic finemapping method employed is unable to effectively deal with it, thus it is best to exclude and analyse separately.

nrow(t1d.DT)
mhc.idx <- which(t1d.DT$chr==6 & between(t1d.DT$pos,25e6,35e6))
if(length(mhc.idx)>0)
    t1d.DT <- t1d.DT[-mhc.idx,]
nrow(t1d.DT)

compute overall cogs score

overall.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets)
#> rCOGS:compute_cogs feature.names argument missing computing overall score using Monocytes,
#> Macrophages_M0,
#> Macrophages_M1,
#> Macrophages_M2,
#> Neutrophils,
#> Megakaryocytes,
#> Endothelial_precursors,
#> Erythroblasts,
#> Foetal_thymus,
#> Naive_CD4,
#> Total_CD4_MF,
#> Total_CD4_Activated,
#> Total_CD4_NonActivated,
#> Naive_CD8,
#> Total_CD8,
#> Naive_B,
#> Total_B,
#> VProm,
#> coding_snp
head(overall.scores[order(cogs,decreasing = TRUE),]) %>% kable

ensg	cogs
ENSG00000187045	0.9569800
ENSG00000128342	0.6052610
ENSG00000100385	0.2368662
ENSG00000185133	0.2196913
ENSG00000198832	0.2192132
ENSG00000099954	0.2183529

Compute a T cell specific score

target_tissue<-c('Total_CD4_Activated','Total_CD4_NonActivated','Naive_CD4','Total_CD4_MF','Naive_CD8','Total_CD8')
tcell.scores <- compute_cogs(t1d.DT,csnps.gr,digest.gr,feature.sets,target_tissue) %>% head(.,)
head(tcell.scores[order(cogs,decreasing = TRUE),]) %>% kable

ensg	cogs
ENSG00000099954	0.2106387
ENSG00000099968	0.0401019
ENSG00000131100	0.0401019
ENSG00000069998	0.0056926
ENSG00000172967	0.0007576
ENSG00000182902	0.0000110

Olly Burren

2020-07-31