In this tutorial, we apply the KnockoffZoom method to toy dataset and guide you through each step of the analysis.
First, download KnockoffZoom from this (repository). The repository includes the core of KnockoffZoom and a copy of this notebook that can be easily executed.
Second, ensure that the following software is available and executable. Note that KnockoffZoom is designed for Linux systems.
You will also need the following R packages to run KnockoffZoom.
This script will check whether the are available and, if they are not, install them for you.
system("Rscript --vanilla ../knockoffzoom/utils/check_packages.R")
Finally, this notebook requires a few packages in addition to the KnockoffZoom core.
library(tidyverse)
library(gridExtra)
library(latex2exp)
library(kableExtra)
library(ggdendro)
The toy dataset provided with KnockoffZoom contains 1000 artificial samples typed at 2000 loci. These loci are divided equally between chromosomes 21 and 22.
chr.min <- 21
chr.max <- 22
chr.list <- seq(chr.min, chr.max)
The genetic data are stored in the standard “BED/BIM/FAM” format, while the phased haplotypes are encoded in a BGEN 1.3 file.
For completeness, we also provide you with a list of individuals and variants that passed quality control. However, all individuals and variants passed quality control in this toy dataset.
# List of individuals that passed QC
qc.samples <- "../data/qc/samples_qc.txt"
# List of variants that passed QC
qc.variants <- "../data/qc/variants_qc.txt"
The phenotypes are saved in a tab-separated text file that also contains information on the sex of the subjects.
# Phenotype file
pheno.file <- "../data/phenotypes/phenotypes.tab"
# Phenotype name
pheno.name <- "y"
Out of curiosity, let’s take a quick look at the distribution of the phenotypes.
phenotypes <- read_tsv(pheno.file, col_types=cols())
phenotypes %>% mutate(sex=as.factor(sex)) %>%
ggplot(aes(x=sex, y=y)) + geom_boxplot() + theme_bw()
The complete list of genetic variants can be inspected as follows.
# Load list of variants in the dataset
Variants <- lapply(chr.list, function(chr) {
bim.file <- sprintf("../data/genotypes/example_chr%d.bim", chr)
bim.colnames <- c("CHR", "SNP", "X0", "BP", "X1", "X2")
read_delim(bim.file, delim="\t", col_names=bim.colnames, col_types=cols()) %>%
select(CHR, SNP, BP)
})
Variants <- do.call("rbind", Variants)
# Keep only variants that passed QC
variants.qc <- read_delim(qc.variants, delim=" ", col_names=c("SNP"), col_types=cols())
Variants <- inner_join(Variants, variants.qc, by = "SNP")
## # A tibble: 2,000 x 3
## CHR SNP BP
## <dbl> <chr> <dbl>
## 1 21 rs7282547 14009151
## 2 21 rs2792379 14027356
## 3 21 rs2792452 14041746
## 4 21 rs1929150 14051249
## 5 21 rs2822068 14056240
## 6 21 rs2260407 14057714
## 7 21 rs2260627 14074393
## 8 21 rs7282370 14074794
## 9 21 rs2260673 14075395
## 10 21 rs2258197 14087640
## # … with 1,990 more rows
Before we apply KnockoffZoom, let’s create temporary directories to store the final results and the intermediate processed information.
tmp.dir <- "tmp"
dir.create(tmp.dir, showWarnings = FALSE)
res.dir <- "results"
dir.create(res.dir, showWarnings = FALSE)
The KnockoffZoom method is organized in 5 modules, as summarized by the flowchart below. In the following, we will apply it step-by-step to the analysis of the above toy dataset.
With the first module KnockoffZoom, we recode the phased haplotypes into the correct format and feed them to fastPHASE
to fit an HMM.
bgen.to.inp <- function(bgen.basename, chr, qc.samples, qc.variants, inp.basename) {
# Convert BGEN to transposed HAPS
cmd <- "../knockoffzoom/utils/bgen_to_hapst.sh"
cmd.args <- sprintf("%s -i %s -c %s -n %s -v %s -b 500 -o %s",
cmd, bgen.basename, chr, qc.samples, qc.variants, inp.basename)
system(cmd.args)
# Convert Transposed HAPS to INP
cmd <- "../knockoffzoom/utils/hapst_to_inp.sh"
cmd.args <- sprintf("%s -i %s -o %s",
cmd, inp.basename, inp.basename)
system(cmd.args)
}
for(chr in chr.list) {
# Basename for the input haplotype files (BGEN format)
bgen.basename <- sprintf("../data/haplotypes/example_chr%d", chr)
# Basename for output haplotype file
inp.basename <- sprintf("%s/example_chr%d", tmp.dir, chr)
# Convert haplotypes into INP file
bgen.to.inp(bgen.basename, chr, qc.samples, qc.variants, inp.basename)
}
fit.hmm <- function(inp.file, hmm.basename) {
cmd <- "fastphase"
fp.k <- 50 # Number of haplotype motifs
fp.it <- 15 # Number of EM iterations
fp.seed <- 1234 # Random seed
cmd.args <- sprintf("%s -Pp -T1 -K%d -g -H-4 -B -C%d -S%d -o%s %s",
cmd, fp.k, fp.it, fp.seed, hmm.basename, inp.file)
system(cmd.args)
}
for(chr in chr.list) {
# Name of input haplotype file
inp.file <- sprintf("%s/example_chr%d.inp", tmp.dir, chr)
# Basename for hmm file
hmm.basename <- sprintf("%s/example_chr%d", tmp.dir, chr)
# Fit HMM
fit.hmm(inp.file, hmm.basename)
}
With the second module KnockoffZoom, we partition the genome into LD blocks to define the hypotheses tested in our multi-resolution analysis. In this example, we use the following 5 levels of resolution.
resolution.list <- c(2, 5, 10, 20, 50)
These numerical values describe each resolution as the number of LD blocks, as a percentage of the total number of available SNPS. Therefore, at the highest resolution each LD block contains on average 2 SNPs.
We begin by applying adjacency-constrained clustering to each chromosome, using the \(r^2\) coefficients computed with PLINK as a similarity measure between the SNPs.
cluster.ld <- function(geno.basename, geno.bim, qc.variants, dendro.file) {
# Basename for LD table computed by PLINK
ld.basename <- sprintf("%s/example_chr%d", tmp.dir, chr)
# LD table computed by PLINK
ld.file <- sprintf("%s.ld", ld.basename)
# Compute LD matrix with PLINK
cmd <- sprintf("plink --bfile %s --keep %s --extract %s --r2 dprime --freq --ld-window 1000 --ld-window-kb 1000 --ld-window-r2 0.01 --out %s",
geno.basename, qc.samples, qc.variants, ld.basename)
system(cmd)
# Compute dendrogram with adjacency-constrained clustering
cmd <- "Rscript --vanilla ../knockoffzoom/utils/cluster.R"
cmd.args <- sprintf("%s %s %s %s %s",
cmd, ld.file, geno.bim, qc.variants, dendro.file)
system(cmd.args)
}
for(chr in chr.list) {
# Basename for the input genotype files (PLINK format)
geno.basename <- sprintf("../data/genotypes/example_chr%d", chr)
# Basename for the input genotype files (PLINK format)
geno.bim <- sprintf("%s.bim", geno.basename)
# Dendrogram file
dendro.file <- sprintf("%s/example_chr%d.RData", tmp.dir, chr)
# Cluster the genome
cluster.ld(geno.basename, geno.bim, qc.variants, dendro.file)
}
As an illustration, the clustering computed above can be visualized in the following dendrogram.
chr <- chr.list[1]
# Load dendrogram
dendro.file <- sprintf("%s/example_chr%d.RData", tmp.dir, chr)
load(dendro.file)
# Convert dendrogram for plotting
library(adjclust)
pdf(file = NULL)
hclust.fit <- plot(Sigma.clust,mode="corrected")
dev.off()
unloadNamespace("adjclust")
# Plot dendrogram
dendro <- dendro_data(hclust.fit)
dendro.data <- segment(dendro) %>% as_tibble
dendro.data %>%
ggplot() +
geom_segment(aes(x=x, y=y, xend=xend, yend=yend), alpha=0.9, size=0.2) +
ylab("") + xlab("") +
theme_bw() +
theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())
The LD blocks at different resolutions are obtained by cutting the dendrogram at different heights.
partition.ld <- function(dendro.file, geno.bim, qc.variants, resolution, blocks.basename) {
cmd <- "Rscript --vanilla ../knockoffzoom/utils/partition.R"
cmd.args <- sprintf("%s %s %s %s %s %s",
cmd, dendro.file, geno.bim, qc.variants, resolution, blocks.basename)
system(cmd.args)
}
for(chr in chr.list) {
# Dendrogram file
dendro.file <- sprintf("%s/example_chr%d.RData", tmp.dir, chr)
# Basename for the input genotype files (PLINK format)
geno.bim <- sprintf("../data/genotypes/example_chr%d.bim", chr)
for(resolution in resolution.list) {
# Basename for list of blocks at this resolution
blocks.basename <- sprintf("%s/example_chr%d", tmp.dir, chr)
# Partition the genome
partition.ld(dendro.file, geno.bim, qc.variants, resolution, blocks.basename)
}
}
The average number of SNPs and width (in Mb) of the LD blocks at each resolution are summarized below.
blocks <- lapply(resolution.list, function(resolution) {
blocks.file <- sprintf("%s/example_chr%d_groups%s.txt", tmp.dir, chr, resolution)
blocks.res <- read_delim(blocks.file, delim=" ", col_types=cols()) %>%
mutate(Resolution=factor(resolution, levels=resolution.list, labels=resolution.list))
})
blocks <- do.call("rbind", blocks)
blocks %>%
group_by(Resolution, Group) %>%
summarise(Size=n(), BP.min=min(BP), BP.max=max(BP)) %>%
mutate(Width=round((BP.max-BP.min)/1e6,3)) %>%
group_by(Resolution) %>%
summarise(`Width (Mb)`=mean(Width), `Size (SNPs)`=mean(Size)) %>%
kable(align='ccc') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
Resolution | Width (Mb) | Size (SNPs) |
---|---|---|
2 | 0.155300 | 50 |
5 | 0.056040 | 20 |
10 | 0.024210 | 10 |
20 | 0.010295 | 5 |
50 | 0.001882 | 2 |
We can also look at the distribution of block sizes at any particular resolution.
blocks.res <- blocks %>%
filter(Resolution=="10") %>%
group_by(Group) %>%
summarise(Size=n())
blocks.res %>% ggplot(aes(x=Size)) +
geom_histogram(bins=20) +
xlim(0,50) +
xlab("Block size (SNPs)") +
theme_bw()
With the third module KnockoffZoom, we generate knockoffs to test the multi-resolution hypotheses defined above.
generate.knockoffs <- function(hmm.basename, inp.basename, blocks.file, knockoffs.basename) {
# Generate knockoffs (PED file)
cmd <- "Rscript --vanilla ../knockoffzoom/utils/knockoffs.R"
cmd.args <- sprintf("%s %s %s %s %s",
cmd, hmm.basename, inp.basename, blocks.file, knockoffs.basename)
system(cmd.args)
# Sample information file
sample.file <- sprintf("%s/example_chr%d.sample", tmp.dir, chr)
# Convert knockoffs to BED file
cmd <- "../knockoffzoom/utils/package_knockoffs.sh"
cmd.args <- sprintf("%s %s %s", cmd, knockoffs.basename, sample.file)
system(cmd.args)
}
for(chr in chr.list) {
# Name of input haplotype file
inp.basename <- sprintf("%s/example_chr%d", tmp.dir, chr)
# Basename for hmm file
hmm.basename <- sprintf("%s/example_chr%d", tmp.dir, chr)
for(resolution in resolution.list) {
# Basename for list of blocks at this resolution
blocks.file <- sprintf("%s/example_chr%d_groups%s.txt", tmp.dir, chr, resolution)
# Basename for knockoffs at this resolution
knockoffs.basename <- sprintf("%s/example_chr%d_res%s", tmp.dir, chr, resolution)
# Generate the knockoffs
generate.knockoffs(hmm.basename, inp.basename, blocks.file, knockoffs.basename)
}
}
Let’s verify whether the knockoffs that we have generated are indeed exchangeable with genotypes. Since the sample size is relatively small in this example, we expect to see some noise in the following goodness-of-fit diagnostics.
source("../knockoffzoom/utils/knockoffs_gof.R")
diagnose.knockoffs <- function(chr, resolution) {
# Basename for knockoffs at this resolution
knockoffs.basename <- sprintf("%s/example_chr%d_res%s", tmp.dir, chr, resolution)
# Basename for knockoffs diagnostics
diagnostics.basename <- sprintf("%s/example_chr%d_res%s", tmp.dir, chr, resolution)
# Knockoff key file
key.file <- sprintf("%s/example_chr%d_res%s.key", tmp.dir, chr, resolution)
# Basename for list of blocks at this resolution
blocks.file <- sprintf("%s/example_chr%d_groups%s.txt", tmp.dir, chr, resolution)
# Compare LD matrix and frequencies with PLINK
cmd <- sprintf("plink --bfile %s --r2 --freq --ld-window 100 --ld-window-kb 1000 --ld-window-r2 0.01 --memory 1000 --out %s",
knockoffs.basename, diagnostics.basename)
system(cmd)
# Basename for plots
plot.basename <- sprintf("%s/example_chr%d_res%s", res.dir, 21, 10)
# Plot knockoff diagnostics
pp <- plot.knockoff.diagnostics(diagnostics.basename, key.file, blocks.file, plot.basename)
}
# Compute diagnostics with PLINK and plot them
diagnose.knockoffs(21, 10)
With the fourth module KnockoffZoom, we compute feature importance statistics for the genotypes and their knockoffs, by fitting a multivariate predictive model of the trait. In this example, we use sparse linear regression with cross-validation. Then, we compute test statistics for each LD block, resolution-by-resolution.
We adopt an efficient implementation of sparse linear regression that is suitable for very large datasets that do not fit in memory. Therefore, we first convert the genotypes and the knockoffs into a memory-mapped file (file-backed matrix, FBM) that we will later be able to process efficiently.
make.fbm <- function(gen.basename, fbm.basename, resolution) {
# Augment genotypes with knockoffs
cmd <- "../knockoffzoom/utils/augment_genotypes.sh"
cmd.args <- sprintf("%s %s %s %s %s %s",
cmd, gen.basename, fbm.basename, resolution, chr.min, chr.max)
system(cmd.args)
# Convert augmented BED to FBM
cmd <- "Rscript --vanilla ../knockoffzoom/utils/make_FBM.R"
cmd.args <- sprintf("%s %s %s", cmd, fbm.basename, fbm.basename)
system(cmd.args)
}
for(resolution in resolution.list) {
# Basename for output FBM
fbm.basename <- sprintf("%s/example_res%s", tmp.dir, resolution)
# Basename for knockoff-augmented genotype files
gen.basename <- sprintf("%s/example", tmp.dir)
# Merge the augmented data from all chromosomes and make FBM
make.fbm(gen.basename, fbm.basename, resolution)
}
Now we fit the sparse regression model, tuning the regularization penalty by cross-validation, and compute the test statistics.
compute.stats <- function(fbm.file, key.basename, pheno.file, pheno.name, stats.basename) {
cmd <- "Rscript --vanilla ../knockoffzoom/utils/lasso.R"
cmd.args <- sprintf("%s %s %s %s %s %s",
cmd, fbm.file, key.basename, pheno.file, pheno.name, stats.basename)
system(cmd.args)
}
for(resolution in resolution.list) {
# Basename for output FBM
fbm.file <- sprintf("%s/example_res%s.rds", tmp.dir, resolution)
# Knockoff key basename (wildcard ? for chromosome number)
key.basename <- sprintf("%s/example_chr?_res%s.key", tmp.dir, resolution)
# Basename for statistics file
stats.basename <- sprintf("%s/example_res%s", tmp.dir, resolution)
# Compute test statistics
compute.stats(fbm.file, key.basename, pheno.file, pheno.name, stats.basename)
}
The significance threshold for the test statistics is calibrated by the knockoff filter. By default, in this example the nominal FDR level is 0.1. The discoveries made at each resolution will be saved in the “results” sub-directory.
filter.stats <- function(stats.basename, blocks.basename, results.basename) {
cmd <- "Rscript --vanilla ../knockoffzoom/utils/filter_stats.R"
cmd.args <- sprintf("%s %s %s %s", cmd, stats.basename, blocks.basename, results.basename)
system(cmd.args)
}
for(resolution in resolution.list) {
# Basename for statistics file
stats.basename <- sprintf("%s/example_res%s", tmp.dir, resolution)
# Basename for list of blocks at this resolution (wildcard ? for chromosome number)
blocks.basename <- sprintf("%s/example_chr?_groups%s.txt", tmp.dir, resolution)
# Basename for results file
results.basename <- sprintf("%s/example_res%s", res.dir, resolution)
# Filter stats and report discoveries
filter.stats(stats.basename, blocks.basename, results.basename)
}
The numbers of distinct discoveries reported at each resolution as listed below.
discoveries <- lapply(resolution.list, function(resolution) {
discoveries.file <- sprintf("%s/example_res%s_discoveries.txt", res.dir, resolution)
discoveries.res <- read_delim(discoveries.file, delim=" ", col_types=cols()) %>%
mutate(Resolution=factor(resolution, levels=resolution.list, labels=resolution.list))
})
discoveries <- do.call("rbind", discoveries)
discoveries %>%
complete(Resolution) %>%
group_by(Resolution) %>%
summarise(Discoveries=sum(!is.na(Group))) %>%
kable(align='cc') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
Resolution | Discoveries |
---|---|
2 | 22 |
5 | 23 |
10 | 20 |
20 | 24 |
50 | 12 |
The discoveries can be visualized interactively with our Shiny app. Two examples are provided here.
source("../visualization/utils_clumping.R")
source("../visualization/utils_plotting.R")
source("../visualization/utils_shiny.R")
# Download the gene and variant annotations if they are not available
cmd <- "cd ../misc; ./download_annotations.sh; cd ../notebooks"
system(cmd)
# Load gene and variant annotations
annotations <- load_annotations("../data")
# Location of LMM p-values
lmm.dir <- "../data/lmm"
# Load results
results <- load_association_results(res.dir, lmm.dir, "example")
window.chr <- 21
window.boundaries <- find_chr_boundaries(results, window.chr)
window.left <- window.boundaries$min.BP
window.right <- window.boundaries$max.BP
plot_combined(window.chr, window.left, window.right,
results$Discoveries,
results$LMM,
results$LMM.clumped,
Annotations.func=annotations$Annotations.func,
Exons.canonical=annotations$Exons.canonical)
window.chr <- 21
window.center <- 14.9
window.width <- 0.2
window.left <- (window.center-window.width/2)*1e6
window.right <- (window.center+window.width/2)*1e6
plot_combined(window.chr, window.left, window.right,
results$Discoveries,
results$LMM,
results$LMM.clumped,
Annotations.func=annotations$Annotations.func,
Exons.canonical=annotations$Exons.canonical)