Skip to contents

slideimp is a lightweight R package for fast K-NN and PCA imputation of missing values in high-dimensional numeric matrices.

Installation

The stable version of slideimp can be installed from CRAN using:

install.packages("slideimp")

You can install the development version of slideimp with:

pak::pkg_install("hhp94/slideimp")

You can install the optional slideimp.extra package (which provides lightweight Illumina manifests) with:

pak::pkg_install("hhp94/slideimp.extra")

PCA or K-NN imputation of Illumina DNAm microarrays

  • This is a simulated beta matrix from the MSA microarray. CpGs are in columns.
dim(MSA_beta_matrix)
# [1]     20 281797
MSA_beta_matrix[1:4, 1:4]
#         cg06185909_TC11 cg18975462_BC11 cg20516119_TC11 cg10149399_BC11
# sample1              NA       0.5023435       0.3835431              NA
# sample2       0.4907466       0.5095459       0.9025816       0.4313347
# sample3       0.6885036              NA       0.7646753       0.4498772
# sample4       0.0000000       0.0000000              NA       0.0000000
  • Chromosome-wise imputation of Illumina microarrays can be performed with a single function call (requires the Illumina manifests, conveniently provided by the slideimp.extra package).

  • This example demonstrates PCA imputation. To use K-NN imputation instead, supply the k argument.

library(slideimp.extra)
library(slideimp)

imputed <- group_imp(
  obj = MSA_beta_matrix,
  group = "MSA", # <- this feature requires the `slideimp.extra` package
  ncp = 10, # <- change to `k` for K-NN imputation
  clamp = c(0, 1), # <- PCA imputation can generate values outside of `c(0, 1)`
  .progress = FALSE # <- turn on to monitor progress of longer running jobs
)
# Found cleaned manifest for 'MSA'
# ! 1 group dropped: no features remaining after matching `obj_cn`.
# ℹ Dropped group indices: 24
# Imputing 25 groups using PCA.
# Running mode: sequential

print(imputed, n = 4, p = 4)
# Method: group_imp (PCA imputation)
# Dimensions: 20 x 281797
#
#         cg06185909_TC11 cg18975462_BC11 cg20516119_TC11 cg10149399_BC11
# sample1       0.1517542       0.5023435      0.38354308       0.2067731
# sample2       0.4907466       0.5095459      0.90258164       0.4313347
# sample3       0.6885036       0.7339375      0.76467530       0.4498772
# sample4       0.0000000       0.0000000      0.05230101       0.0000000
# # Showing 4 of 20 rows and 4 of 281797 columns

Troubleshooting tips

  • group_imp() fails with unmapped probes: your matrix likely contains sesame generated control probes (prefixed ctl_) or you picked the wrong manifest ("EPICv2" vs. "EPICv2_deduped"). Check the manifest, remove "ctl" probes with obj <- obj[, !grepl("^ctl", colnames(obj))], or pass allow_unmapped = TRUE to bypass.
  • Impute the MSA or EPICv2 data with the duplicated probes, then dedup with slideimp.extra::dedup_matrix().

Performance tips

  • For PCA performance tuning, including guidance on choosing threshold, selecting solver, configuring lobpcg_control, and using mirai or BLAS threading effectively, see Speeding up PCA imputation.
  • For PCA imputation of bounded data, i.e., DNA methylation beta values, use clamp = c(0, 1) to restrict PCA-imputed values to the valid range.
  • For parallel PCA imputation with a threaded BLAS such as OpenBLAS or MKL, set pin_blas = TRUE so BLAS threads and mirai workers do not compete for cores. This requires RhpcBLASctl.
  • For faster PCA imputation on Windows machines, advanced users can replace R’s default reference BLAS with OpenBLAS. See this community guide.
  • For parallel K-NN imputation, use the cores argument, which uses RcppThread’s parallelFor, instead of mirai. If you only need clock CpGs, pass subset to skip all other probes.

Extended Workflow

  • We simulate DNA methylation (DNAm) microarray data from 2 chromosomes. All slideimp functions expect the input to be a numeric matrix where variables are stored in the columns.
library(slideimp)
set.seed(1234)
sim_obj <- sim_mat(n = 20, p = 100, n_col_groups = 2)

# Extract the $input attribute
obj <- sim_obj$input

# Extract metadata to map the features to groups
group_df <- sim_obj$col_group
  • Hyperparameters are tuned using tune_imp(). We evaluate the following options with grid search:

    • Number of nearest neighbors (k): 5 or 20
    • Inverse distance power (dist_pow) for weighted averaging: 1 or 2 (higher values assign lower weights to more distant neighbors)
  • Tuning is performed on a subset of the data. We use 10 repeats (n_reps = 10) of cross-validation for evaluation. We artificially mask 50 observed values (num_na = 50) to compute the RMSE and MAE. Use larger values for both n_reps and num_na in real analyses for more reliable error estimates.

  • Note: Parallelization via the cores argument is only available for K-NN imputation with RcppThread.

knn_params <- expand.grid(k = c(5, 20), dist_pow = c(1, 2))
group2_columns <- subset(group_df, group == "group2")
group2_only <- obj[, group2_columns$feature]

tune_knn <- tune_imp(
  group2_only,
  parameters = knn_params,
  .f = "knn_imp",
  cores = 8,
  n_reps = 10,
  num_na = 50
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: threaded (8 cores)
#> Step 2/2: Tuning
metrics <- compute_metrics(tune_knn)

# equivalently: dplyr::summarize(metrics, ..., .by = c(k, dist_pow, .metric))
sum_metrics <- do.call(
  data.frame,
  aggregate(
    .estimate ~ k + dist_pow + .metric,
    data = metrics,
    FUN = function(x) {
      c(
        n = length(x),
        mean_error = mean(x),
        sd_error = sd(x)
      )
    }
  )
)

sum_metrics[order(sum_metrics$.estimate.mean_error), ]
#>    k dist_pow .metric .estimate.n .estimate.mean_error .estimate.sd_error
#> 2 20        1     mae          10            0.1501427         0.01599598
#> 4 20        2     mae          10            0.1519464         0.01638402
#> 1  5        1     mae          10            0.1656087         0.01683512
#> 3  5        2     mae          10            0.1669792         0.01666404
#> 6 20        1    rmse          10            0.1897722         0.01915653
#> 8 20        2    rmse          10            0.1918765         0.01936392
#> 5  5        1    rmse          10            0.2081833         0.01949624
#> 7  5        2    rmse          10            0.2101955         0.01884442
  • For PCA imputation and custom functions, set up parallelization with mirai::daemons(). See Speeding up PCA imputation for additional PCA performance-tuning guidance.
  • Note: For machines with multi-threaded BLAS, turn on pin_blas = TRUE when tuning PCA imputation in parallel to avoid thrashing.
mirai::daemons(2) # 2 Cores

# PCA imputation.
pca_params <- data.frame(ncp = c(1, 5))
# For machines with multi-threaded BLAS, turn on `pin_blas = TRUE`
tune_pca <- tune_imp(obj, parameters = pca_params, .f = "pca_imp", n_reps = 10, num_na = 50)

mirai::daemons(0) # Close daemons
  • Then perform imputation with group_imp() using the best parameters.
knn_group_results <- group_imp(obj, group = group_df, k = 20, dist_pow = 1, cores = 2)
#> Imputing 2 groups using KNN.
#> Running mode: threaded (2 cores)
knn_group_results
#> Method: group_imp (KNN imputation)
#> Dimensions: 20 x 100
#> 
#>          feature1  feature2  feature3  feature4  feature5  feature6
#> sample1 0.3486482 0.7385414 0.4077444 0.1607935 0.3924661 0.2434143
#> sample2 0.5338935 0.4724364 0.9663621 0.4788070 0.5061132 0.3923603
#> sample3 0.7185848 0.7351035 0.6724479 0.3162537 0.7634236 1.0000000
#> sample4 0.1734418 0.0000000 0.0000000 0.0000000 0.0000000 0.2106358
#> sample5 0.5388440 0.5306182 0.5685354 0.5383513 0.4680080 0.8518388
#> sample6 0.3768380 0.5570723 0.8764909 0.5276245 0.6722794 0.5740639
#> # Showing 6 of 20 rows and 6 of 100 columns
  • PCA imputation with group_imp() can be parallelized using the mirai package, similar to tune_imp(). pin_blas = TRUE can help PCA imputation on multi-threaded BLAS machines.
# Similar to `tune_imp`, parallelization is controlled by `mirai::daemons()`
mirai::daemons(2)
pca_group_results <- group_imp(obj, group = group_df, ncp = 10)
mirai::daemons(0)
full_knn_results <- knn_imp(obj = obj, k = 20)
full_pca_results <- pca_imp(obj = obj, ncp = 10)

Sliding Window Imputation

  • slide_imp() performs sliding window imputation.
  • This function is not for Illumina DNAm microarrays. See group_imp().
  • Note:
    • DNAm WGBS/EM-seq data should be grouped by chromosomes (i.e., run slide_imp() separately on each chromosome) before imputation. See the package vignette for more details.
    • See vignette for details about selecting window_size and overlap_size with tune_imp().
# Simulate some data
chr1_beta <- sim_mat(n = 10, p = 2000)$input
dim(chr1_beta)
#> [1]   10 2000
chr1_beta[1:5, 1:5]
#>          feature1  feature2  feature3   feature4  feature5
#> sample1 0.7500638 0.5323295 0.6095626 0.96762386 0.5149855
#> sample2 0.2809107 0.8695599        NA         NA 0.6040981
#> sample3 0.9409348 0.5445597 0.6432675 1.00000000 0.5613868
#> sample4 0.5946795 0.0000000        NA 0.07237333 0.4410413
#> sample5 0.8664253 0.6206139 0.3444691 0.52025046 0.5220036
  • slide_imp() parameters:

    • location: required - a sorted numeric vector of length ncol(obj) specifying the position of each column (e.g., genomic coordinates in bp).
    • window_size: required - width of each sliding window (same unit as location).
    • overlap_size: optional - overlap width between consecutive windows (same units as location). Must be strictly less than window_size.
    • min_window_n: required - minimum number of columns a window must contain to be imputed. Windows with fewer columns than this threshold are dropped. Must be greater than k (for K-NN) or ncp (for PCA).
    • dry_run: optional - return just the calculated windows to examine which windows are included.
    • k: required - (specifying K-NN imputation) number of nearest neighbors to use inside each window.
    • ncp: required - (specifying PCA imputation) number of principal components to retain. Use this instead of k when performing sliding-window PCA imputation.
    • subset: optional - impute just a subset of features (i.e., just clock CpGs).
    • flank: optional - build flanking windows of window_size around features provided in subset.
  • First, let’s perform a dry run to examine the windows that will be imputed by slide_imp.

location <- seq_len(ncol(chr1_beta)) # 1, 2, ..., 2000 for this simulated chromosome

slide_imp(
  obj = chr1_beta,
  location = location,
  window_size = 50, # select with `tune_imp()`
  overlap_size = 5,
  min_window_n = 11, # must be > k = 10
  dry_run = TRUE
)
#> # slideimp table: 45 x 4
#>  start end window_n  subset_local
#>      1  50       50 <double [50]>
#>     46  95       50 <double [50]>
#>     91 140       50 <double [50]>
#>    136 185       50 <double [50]>
#>    181 230       50 <double [50]>
#>    226 275       50 <double [50]>
#>    271 320       50 <double [50]>
#>    316 365       50 <double [50]>
#>    361 410       50 <double [50]>
#>    406 455       50 <double [50]>
#> # ... with 35 more rows
  • Then, we can perform the imputation using the given parameters.
slide_imp(
  obj = chr1_beta,
  location = location,
  window_size = 50, # select with `tune_imp()`
  overlap_size = 5,
  min_window_n = 11, # must be > k = 10
  k = 10, # select with `tune_imp()`
  cores = 2,
  .progress = FALSE
)
#> Method: slide_imp (KNN imputation)
#> Dimensions: 10 x 2000
#> 
#>          feature1  feature2  feature3   feature4  feature5  feature6
#> sample1 0.7500638 0.5323295 0.6095626 0.96762386 0.5149855 1.0000000
#> sample2 0.2809107 0.8695599 0.6324029 0.62469147 0.6040981 0.1583159
#> sample3 0.9409348 0.5445597 0.6432675 1.00000000 0.5613868 0.3054879
#> sample4 0.5946795 0.0000000 0.5837423 0.07237333 0.4410413 0.6101870
#> sample5 0.8664253 0.6206139 0.3444691 0.52025046 0.5220036 0.7464794
#> sample6 0.8157626 0.3053222 0.7227880 0.57711498 0.4576367 0.0000000
#> # Showing 6 of 10 rows and 6 of 2000 columns

Core functions

  • group_imp(): Parallelizable group-wise (e.g., by chromosomes or column clusters) K-NN or PCA imputation with optional auxiliary features and group-wise parameters. The group argument supports:
    • Choices such as "EPICv2" or "MSA" (requires the slideimp.extra package).
    • (Basic): a data.frame containing a feature column (character strings of feature names) and a group column (defining the group).
    • (Advanced): use prep_groups() to create groups based on a mapping data.frame (i.e., Illumina manifests) with custom group-wise parameters.
  • slide_imp(): Sliding window K-NN or PCA imputation for extremely high-dimensional numeric matrices (WGBS/EM-seq) with ordered features (i.e., by genomic position).
  • tune_imp(): Parallelizable hyperparameter tuning with repeated cross-validation that works with built-in or custom imputation functions.
  • knn_imp(): Full K-NN imputation with multi-core parallelization, mlpack Ball tree nearest neighbor implementation (for data with very low missing rates and extremely high dimensions), and optional subset imputation (ideal for epigenetic clock calculations).
  • pca_imp(): Optimized version of missMDA::imputePCA() for high-dimensional numeric matrices.
  • col_vars() and mean_imp_col() provide fast column-wise variance and mean imputation using the high-performance RcppArmadillo backend with full RcppThread parallelization support.