Skip to contents

Impute missing values in a numeric matrix using full K-nearest neighbors (K-NN).

Usage

knn_imp(
  obj,
  k,
  colmax = 0.9,
  method = c("euclidean", "manhattan"),
  cores = 1,
  post_imp = TRUE,
  subset = NULL,
  dist_pow = 0,
  na_check = TRUE,
  .progress = FALSE
)

Arguments

obj

A numeric matrix with samples in rows and features in columns.

k

Integer. Number of nearest neighbors to use for K-NN imputation.

colmax

Numeric scalar between 0 and 1. Columns with a missing-data proportion greater than colmax are excluded from the main imputation method. Excluded columns are left unchanged unless post_imp = TRUE, in which case remaining missing values are replaced by column means when possible.

method

Character. K-NN imputation distance method: either "euclidean" or "manhattan".

cores

Integer. Number of cores to use for K-NN imputation. Defaults to 1.

post_imp

Logical. If TRUE, replace missing values remaining after the main imputation method with column means when possible.

subset

Optional character or integer vector specifying columns to target for imputation. If NULL, all eligible columns are targeted.

dist_pow

Numeric. Power used to penalize more distant neighbors in the weighted average. dist_pow = 0 gives an unweighted average of the nearest neighbors.

na_check

Logical. If TRUE, check whether the returned matrix still contains missing values.

.progress

Logical. If TRUE, show imputation progress.

Value

A numeric matrix of the same dimensions as obj, with missing values imputed. The returned object has class slideimp_results.

Details

knn_imp() performs imputation column-wise, treating rows as observations and columns as features.

Nearest neighbors are found using brute-force K-NN.

When dist_pow > 0, imputed values are computed as distance-weighted averages. Weights are inverse distances raised to the power of dist_pow.

K-NN performance optimization

  • Use subset when only specific columns need imputation.

  • Use grouped or sliding-window imputation for very large matrices.

References

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525. doi:10.1093/bioinformatics/17.6.520

Examples

set.seed(123)
obj <- sim_mat(20, 20, perc_col_na = 1)$input
sum(is.na(obj))
#> [1] 40

# Select `k` with `tune_imp()`.
result <- knn_imp(obj, k = 10, .progress = FALSE)
result
#> Method: KNN imputation
#> Dimensions: 20 x 20
#> 
#>           feature1  feature2  feature3  feature4  feature5  feature6
#> sample1 0.08885928 0.0946424 0.5010982 0.2257198 0.3310293 0.3919172
#> sample2 0.35087647 0.2569208 0.4441198 0.3953534 0.5894579 0.2589739
#> sample3 0.56864697 0.4021824 0.7354948 0.6422099 0.8454413 0.6652821
#> sample4 0.30420093 0.7886995 0.3732225 0.5291319 0.5289648 0.4384578
#> sample5 0.34030847 0.6095144 0.3741498 0.3364915 0.4772101 0.8290134
#> sample6 0.45667473 0.4614949 0.8676809 0.7274044 0.9167450 0.6643710
#> # Showing 6 of 20 rows and 6 of 20 columns