PONG2 Imputation Workflow

Overview

This vignette provides a complete, step-by-step guide to performing KIR allele imputation using the impute command in PONG2.

The workflow covers:

Preparing input data (PLINK → chr19 extraction)
Running basic PONG2 imputation
Checking SNP overlap with the 1KGP reference panel
Pre-phasing the KIR region with Eagle2
Local pre-imputation using minimac4 (--fill-missing)
External pre-imputation via Michigan Imputation Server
Interpreting results

Prerequisites

Requirement	Version	Notes
PLINK2	≥ 2.0	Must be in PATH
R	≥ 4.0	With PONG2 installed
minimac4	≥ 4.1.6	Only for `--fill-missing`
Eagle2	≥ 2.4	Only for pre-phasing
bgzip & tabix	HTSlib	Only for `--fill-missing`

Step 1: Prepare Input Data

PONG2 works best when input files are restricted to chromosome 19 (covering the KIR locus). Extract chr19 from your full-genome PLINK files:

plink2 \
  --bfile your_full_genome_prefix \
  --chr 19 \
  --make-bed \
  --out chr19_only

This creates chr19_only.bed, chr19_only.bim, and chr19_only.fam.

Step 2: Run Basic PONG2 Imputation

# --filter can be 0.005 or 0.01
# 0.005 allows more rare KIR alleles in the output
pong2 impute \
  -i chr19_only \
  -o results/basic \
  -l KIR3DL1 \
  -a hg38 \
  -t 16 \
  --filter 0.005

PONG2 will automatically check the SNP overlap between your data and the 1KGP reference panel in the KIR region and report the match rate.

Step 3: Check SNP Overlap

NOTE: KIR Region SNP Overlap between input data and 1KGP

Overlap rate is computed between your input data and the 1000 Genomes Project (1KGP) reference panel in the KIR region:

Assembly KIR Region Coordinates

hg19 chr19:55,000,000–55,400,000

hg38 chr19:54,000,000–55,000,000

Overlap Rate Status Action

≥ 50% Pass Proceed with PONG2 directly

< 50% Fail Run Eagle2 + pre-imputation first

Assembly	KIR Region Coordinates
hg19	chr19:55,000,000–55,400,000
hg38	chr19:54,000,000–55,000,000

Overlap Rate	Status	Action
≥ 50%	Pass	Proceed with PONG2 directly
< 50%	Fail	Run Eagle2 + pre-imputation first

If your match rate is sufficient (≥ 50%), PONG2 will proceed automatically. If not, use one of the pre-imputation strategies below.

Step 4: Pre-imputation (when SNP overlap < 50%)

Pre-phasing the KIR region is required before any pre-imputation strategy.

Pre-phase with Eagle2

hg19

eagle \
  --bfile=chr19_only \
  --geneticMapFile=genetic_map_hg19.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=55000000 \
  --bpEnd=55400000

hg38

eagle \
  --bfile=chr19_only \
  --geneticMapFile=genetic_map_hg38.txt.gz \
  --outPrefix=chr19.phased \
  --chrom=19 \
  --numThreads=20 \
  --bpStart=54000000 \
  --bpEnd=55000000

Eagle2 outputs a phased VCF: chr19.phased.vcf.gz

Option A: Local Pre-imputation with minimac4 (built-in)

Pass the pre-phased VCF directly to PONG2 using --vcf and --fill-missing.

Important: --vcf is the only input required with --fill-missing.
PLINK files cannot hold phased haplotype data — the pipeline derives everything from the VCF internally. Do not supply -i together with --fill-missing.

pong2 impute \
  --vcf chr19.phased.vcf.gz \
  -o results/local_impute \
  -l KIR3DL1 \
  -a hg19 \
  -t 20 \
  --filter 0.005 \
  --fill-missing

Option B: External Pre-imputation (recommended for highest accuracy)

Pre-impute your chr19 data using a public server before running PONG2. This is the approach used in the PONG2 manuscript.

Step B1: Export phased VCF

The phased VCF from Eagle2 (chr19.phased.vcf.gz) is ready for upload. If you need to export from PLINK first:

plink2 \
  --bfile chr19_only \
  --export vcf bgz \
  --out chr19_only
tabix -p vcf chr19_only.vcf.gz

Step B2: Upload to Michigan Imputation Server

URL: https://imputationserver.sph.umich.edu/
Reference panel: TOPMed r5 (recommended for diverse populations) or 1KGP Phase 3
Genome build: match your data (hg19 or hg38)
Chromosome: 19 only
Phasing: select Eagle v2.4 if uploading unphased data; skip if already phased
Submit and wait for email notification (typically hours to days)

Step B3: Download and convert imputed VCF to PLINK

# Unzip results (password provided by server via email)
unzip -P <password> chr19.zip

# Convert imputed VCF to PLINK
plink2 \
  --vcf chr19.dose.vcf.gz dosage=DS \
  --import-dosage-certainty 0.3 \
  --make-bed \
  --out imputed_chr19

Step B4: Run PONG2 on imputed data

pong2 impute \
  -i imputed_chr19 \
  -o results/final \
  -l KIR3DL1 \
  -a hg38 \
  -t 16 \
  --filter 0.005

Option C: Force imputation (not recommended)

Proceed despite low SNP match rate — use only when you understand the implications for accuracy:

pong2 impute \
  -i chr19_only \
  -o results/forced \
  -l KIR3DL1 \
  -a hg19 \
  --force

Step 5: Interpreting Output

After pong2 impute completes, results are saved in <output>/KIR/:

File	Description
`KIR/<locus>.csv`	Predicted KIR alleles per sample (main results)
`KIR/<locus>.RData`	Full prediction object including allele probabilities

Output CSV format

sample.id, KIR3DL1.1, KIR3DL1.2, prob.KIR3DL1.1, prob.KIR3DL1.2
HG00096,   KIR3DL1*001, KIR3DL1*002, 0.98, 0.95
HG00097,   KIR3DL1*005, KIR3DL1*015, 0.87, 0.91

Large sample datasets

For datasets with >2,000 samples, PONG2 automatically splits prediction into chunks of 2,000 samples to prevent memory issues. Results are combined and saved as a single output file — no action required from the user.

Summary: Which Workflow to Choose?

Scenario	Recommended approach
SNP overlap ≥ 50%	Run `pong2 impute -i` directly
SNP overlap < 50%, quick run needed	Eagle2 → `pong2 impute --vcf --fill-missing`
SNP overlap < 50%, highest accuracy	Eagle2 → Michigan Server → `pong2 impute -i`
Low overlap, understand risks	`pong2 impute -i --force`

Next Steps

See vignette PONG2-training for custom model training
Run the complete end-to-end workflow script: example/full_workflow.sh
Report issues: Open a GitHub issue

Happy KIR imputation! 🧬

Norman Lab