Overview

This vignette provides a complete, step-by-step guide to performing KIR allele imputation using the impute command in PONG2.

The workflow covers:

  • Preparing input data (PLINK → chr19 extraction)
  • Running basic PONG2 imputation
  • Checking SNP overlap with the 1KGP reference panel
  • Pre-phasing the KIR region with Eagle2
  • Local pre-imputation using minimac4 (--fill-missing)
  • External pre-imputation via Michigan Imputation Server
  • Interpreting results

Prerequisites

Requirement Version Notes
PLINK2 ≥ 2.0 Must be in PATH
R ≥ 4.0 With PONG2 installed
minimac4 ≥ 4.1.6 Only for --fill-missing
Eagle2 ≥ 2.4 Only for pre-phasing
bgzip & tabix HTSlib Only for --fill-missing

Step 1: Prepare Input Data

PONG2 works best when input files are restricted to chromosome 19 (covering the KIR locus). Extract chr19 from your full-genome PLINK files:

plink2 \
  --bfile your_full_genome_prefix \
  --chr 19 \
  --make-bed \
  --out chr19_only

This creates chr19_only.bed, chr19_only.bim, and chr19_only.fam.


Step 2: Run Basic PONG2 Imputation

# --filter can be 0.005 or 0.01
# 0.005 allows more rare KIR alleles in the output
pong2 impute \
  -i chr19_only \
  -o results/basic \
  -l KIR3DL1 \
  -a hg38 \
  -t 16 \
  --filter 0.005

PONG2 will automatically check the SNP overlap between your data and the 1KGP reference panel in the KIR region and report the match rate.


Step 3: Check SNP Overlap

NOTE: KIR Region SNP Overlap between input data and 1KGP

Overlap rate is computed between your input data and the 1000 Genomes Project (1KGP) reference panel in the KIR region:

Assembly KIR Region Coordinates
hg19 chr19:55,000,000–55,400,000
hg38 chr19:54,000,000–55,000,000
Overlap Rate Status Action
≥ 50% Pass Proceed with PONG2 directly
< 50% Fail Run Eagle2 + pre-imputation first

If your match rate is sufficient (≥ 50%), PONG2 will proceed automatically. If not, use one of the pre-imputation strategies below.


Step 4: Pre-imputation (when SNP overlap < 50%)

Pre-phasing the KIR region is required before any pre-imputation strategy.

Pre-phase with Eagle2

hg19

eagle \
  --bfile=chr19_only \
  --geneticMapFile=genetic_map_hg19.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=55000000 \
  --bpEnd=55400000

hg38

eagle \
  --bfile=chr19_only \
  --geneticMapFile=genetic_map_hg38.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=54000000 \
  --bpEnd=55000000

Eagle2 outputs a phased VCF: chr19.phased.vcf.gz


Option A: Local Pre-imputation with minimac4 (built-in)

Pass the pre-phased VCF directly to PONG2 using --vcf and --fill-missing.

Important: --vcf is the only input required with --fill-missing.
PLINK files cannot hold phased haplotype data — the pipeline derives everything from the VCF internally. Do not supply -i together with --fill-missing.

pong2 impute \
  --vcf chr19.phased.vcf.gz \
  -o results/local_impute \
  -l KIR3DL1 \
  -a hg19 \
  -t 20 \
  --filter 0.005 \
  --fill-missing

Pre-impute your chr19 data using a public server before running PONG2. This is the approach used in the PONG2 manuscript.

Step B1: Export phased VCF

The phased VCF from Eagle2 (chr19.phased.vcf.gz) is ready for upload. If you need to export from PLINK first:

plink2 \
  --bfile chr19_only \
  --export vcf bgz \
  --out chr19_only
tabix -p vcf chr19_only.vcf.gz

Step B2: Upload to Michigan Imputation Server

  • URL: https://imputationserver.sph.umich.edu/
  • Reference panel: TOPMed r5 (recommended for diverse populations) or 1KGP Phase 3
  • Genome build: match your data (hg19 or hg38)
  • Chromosome: 19 only
  • Phasing: select Eagle v2.4 if uploading unphased data; skip if already phased
  • Submit and wait for email notification (typically hours to days)
# Unzip results (password provided by server via email)
unzip -P <password> chr19.zip

# Convert imputed VCF to PLINK
plink2 \
  --vcf chr19.dose.vcf.gz dosage=DS \
  --import-dosage-certainty 0.3 \
  --make-bed \
  --out imputed_chr19

Step B4: Run PONG2 on imputed data

pong2 impute \
  -i imputed_chr19 \
  -o results/final \
  -l KIR3DL1 \
  -a hg38 \
  -t 16 \
  --filter 0.005

Proceed despite low SNP match rate — use only when you understand the implications for accuracy:

pong2 impute \
  -i chr19_only \
  -o results/forced \
  -l KIR3DL1 \
  -a hg19 \
  --force

Step 5: Interpreting Output

After pong2 impute completes, results are saved in <output>/KIR/:

File Description
KIR/<locus>.csv Predicted KIR alleles per sample (main results)
KIR/<locus>.RData Full prediction object including allele probabilities

Output CSV format

sample.id, KIR3DL1.1, KIR3DL1.2, prob.KIR3DL1.1, prob.KIR3DL1.2
HG00096,   KIR3DL1*001, KIR3DL1*002, 0.98, 0.95
HG00097,   KIR3DL1*005, KIR3DL1*015, 0.87, 0.91

Large sample datasets

For datasets with >2,000 samples, PONG2 automatically splits prediction into chunks of 2,000 samples to prevent memory issues. Results are combined and saved as a single output file — no action required from the user.


Summary: Which Workflow to Choose?

Scenario Recommended approach
SNP overlap ≥ 50% Run pong2 impute -i directly
SNP overlap < 50%, quick run needed Eagle2 → pong2 impute --vcf --fill-missing
SNP overlap < 50%, highest accuracy Eagle2 → Michigan Server → pong2 impute -i
Low overlap, understand risks pong2 impute -i --force

Next Steps

Happy KIR imputation! 🧬