Skip to contents

Tune method specific hyperparameters by repeatedly masking observed values, imputing them, and comparing the imputed values with the original values.

Usage

tune_imp(
  obj,
  parameters = NULL,
  .f,
  na_loc = NULL,
  num_na = NULL,
  n_reps = 1,
  n_cols = NULL,
  n_rows = 2,
  rowmax = 0.9,
  colmax = 0.9,
  na_col_subset = NULL,
  max_attempts = 100,
  .progress = TRUE,
  cores = 1,
  location = NULL,
  pin_blas = FALSE
)

Arguments

obj

A numeric matrix.

parameters

A data.frame specifying parameter combinations to tune. Each column should be a parameter accepted by .f, excluding obj. List-columns are supported for complex parameters. Duplicate rows are removed. NULL is treated as a single parameter set with no additional arguments, which is useful for functions whose required arguments all have defaults.

.f

One of "knn_imp", "pca_imp", or "slide_imp", or a custom imputation function.

na_loc

Optional predefined missing-value locations. Accepted formats are a two-column integer matrix of row and column indices, a numeric vector of linear positions, or a list whose elements are either of those formats.

num_na

Integer or NULL. Total number of missing values to inject per repetition. If supplied, n_cols is derived from num_na and n_rows, and missing values are distributed as evenly as possible across columns.

n_reps

Integer. Number of independent repetitions.

n_cols

Integer or NULL. Must be supplied when both num_na and na_loc are NULL, unless the automatic default applies.

n_rows

Integer. Target number of missing values to inject per selected column.

rowmax

Numeric scalar between 0 and 1. Maximum allowed missing-data proportion per row after injection.

colmax

Numeric scalar between 0 and 1. Maximum allowed missing-data proportion per column after injection.

na_col_subset

Optional integer or character vector restricting which columns are eligible for missing-value injection.

max_attempts

Integer. Maximum number of resampling attempts per repetition before giving up.

.progress

Logical. If TRUE, show progress during tuning.

cores

Integer. Number of cores to use for K-NN and sliding-window K-NN imputation. For other methods, use mirai::daemons().

location

Numeric vector of column locations. Required when .f = "slide_imp".

pin_blas

Logical. If TRUE, pin BLAS threads to 1 during parallel tuning to reduce thread contention.

Value

A data frame of class slideimp_tune containing:

  • columns originally provided in parameters;

  • param_set, an integer ID for each unique parameter combination;

  • rep_id, an integer repetition index;

  • result, a list-column where each element is a data frame containing truth and estimate columns;

  • error, a character column containing the error message if the iteration failed, otherwise NA.

Details

Built-in methods can be selected by passing .f = "knn_imp", .f = "pca_imp", or .f = "slide_imp". A custom function can also be supplied. Custom functions must accept obj as their first argument and return a numeric matrix with the same dimensions as obj.

When na_loc is supplied, num_na, n_cols, n_rows, and na_col_subset are ignored.

When .f is a character string, columns in parameters are validated against the selected method:

  • "knn_imp" requires k.

  • "pca_imp" requires ncp.

  • "slide_imp" requires window_size, overlap_size, and min_window_n, plus exactly one of k or ncp.

To tune parameters for grouped imputation, tune knn_imp() or pca_imp() on representative groups, then pass the selected parameters to group_imp().

The top-level rowmax and colmax arguments control random missing-value injection performed by sample_na_loc(). To tune or pass an imputation method's own colmax argument, include a colmax column in parameters.

Tuning results can be summarized with compute_metrics() or evaluated with external packages such as yardstick.

Parallelization

  • K-NN: use the cores argument. If mirai daemons are active, cores is automatically set to 1 to avoid nested parallelism.

  • PCA: use mirai::daemons() instead of cores.

When running PCA imputation in parallel with mirai, set pin_blas = TRUE in tune_imp() or group_imp() to prevent BLAS threads from oversubscribing CPU cores. This relies on RhpcBLASctl and works with OpenBLAS and MKL (typical on Linux, and on Windows after an OpenBLAS swap). pin_blas = TRUE may have no effect on macOS.

PCA Performance tips

Speed comes from three levers: solver (through LOBPCG with warm-start), threshold, and scale. Tune these first, then accuracy parameters (ncp, coeff.ridge) on a representative subset.

Exact vs. LOBPCG with warm-start. Whether "lobpcg" beats "exact" depends on size and low-rankness: "lobpcg" is preferred for large, approximately low-rank matrices with small ncp, and "exact" for small matrices (including slide_imp() windows), where it is faster and more robust. Separately, the warm-start makes each successive solve cheap: pca_imp() warm-starts LOBPCG with the previous eigenblock and search direction, so once imputed values stabilize, later solves converge in a few iterations. The payoff therefore grows with the number of EM iterations, independent of low-rankness. solver = "auto" (default) probes both and is a safe start.

Threshold. The default 1e-6 is conservative; 1e-5 is often faster with very similar values.

Scale. For columns on a common scale (e.g., DNAm beta values in [0, 1]), scale = FALSE can be faster and more accurate.

Parallel and BLAS. In parallel via tune_imp() or group_imp() with a multithreaded BLAS, set pin_blas = TRUE to avoid thread oversubscription. On Windows, the stock BLAS can be slow. Advanced users can swap in OpenBLAS.

See Speeding up PCA imputation for the full workflow.

Examples

set.seed(123)

# Simulate some data
obj <- sim_mat(10, 50)$input

# Tune K-NN imputation with random missing-value injection.
# Use larger `num_na` and `n_reps` values for real analyses.
params_knn <- data.frame(k = c(2, 4))
results <- tune_imp(
  obj,
  params_knn,
  .f = "knn_imp",
  n_reps = 1,
  num_na = 10,
  .progress = FALSE
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
compute_metrics(results)
#>   k .progress param_set rep_id error  n n_miss .metric .estimator .estimate
#> 1 2     FALSE         1      1  <NA> 10      0     mae   standard 0.2075860
#> 2 2     FALSE         1      1  <NA> 10      0    rmse   standard 0.2841912
#> 3 4     FALSE         2      1  <NA> 10      0     mae   standard 0.1829599
#> 4 4     FALSE         2      1  <NA> 10      0    rmse   standard 0.2202581

# Tune with fixed missing-value positions
na_positions <- list(
  matrix(c(1, 2, 3, 1, 1, 1), ncol = 2),
  matrix(c(2, 3, 4, 2, 2, 2), ncol = 2)
)

results_fixed <- tune_imp(
  obj,
  data.frame(k = 2),
  .f = "knn_imp",
  na_loc = na_positions,
  .progress = FALSE
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning

# Custom imputation function
custom_fill <- function(obj, val = 0) {
  obj[is.na(obj)] <- val
  obj
}

tune_imp(
  obj,
  data.frame(val = c(0, 1)),
  .f = custom_fill,
  num_na = 10,
  .progress = FALSE
)
#> Tuning custom function
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
#> # slideimp table: 2 x 5
#>  val param_set rep_id        result error
#>    0         1      1 <df [10 x 2]>  <NA>
#>    1         2      1 <df [10 x 2]>  <NA>

if (FALSE) { # interactive() && requireNamespace("mirai", quietly = TRUE)
# Parallel tuning with mirai
mirai::daemons(2)

parameters_custom <- data.frame(mean = c(0, 1), sd = c(1, 1))

custom_imp <- function(obj, mean, sd) {
  na_pos <- is.na(obj)
  obj[na_pos] <- stats::rnorm(sum(na_pos), mean = mean, sd = sd)
  obj
}

results_p <- tune_imp(
  obj,
  parameters_custom,
  .f = custom_imp,
  n_reps = 1,
  num_na = 10,
  .progress = FALSE
)

mirai::daemons(0)
}