Tune method specific hyperparameters by repeatedly masking observed values, imputing them, and comparing the imputed values with the original values.
Usage
tune_imp(
obj,
parameters = NULL,
.f,
na_loc = NULL,
num_na = NULL,
n_reps = 1,
n_cols = NULL,
n_rows = 2,
rowmax = 0.9,
colmax = 0.9,
na_col_subset = NULL,
max_attempts = 100,
.progress = TRUE,
cores = 1,
location = NULL,
pin_blas = FALSE
)Arguments
- obj
A numeric matrix.
- parameters
A
data.framespecifying parameter combinations to tune. Each column should be a parameter accepted by.f, excludingobj. List-columns are supported for complex parameters. Duplicate rows are removed.NULLis treated as a single parameter set with no additional arguments, which is useful for functions whose required arguments all have defaults.- .f
One of
"knn_imp","pca_imp", or"slide_imp", or a custom imputation function.- na_loc
Optional predefined missing-value locations. Accepted formats are a two-column integer matrix of row and column indices, a numeric vector of linear positions, or a list whose elements are either of those formats.
- num_na
Integer or
NULL. Total number of missing values to inject per repetition. If supplied,n_colsis derived fromnum_naandn_rows, and missing values are distributed as evenly as possible across columns.- n_reps
Integer. Number of independent repetitions.
- n_cols
Integer or
NULL. Must be supplied when bothnum_naandna_locareNULL, unless the automatic default applies.- n_rows
Integer. Target number of missing values to inject per selected column.
- rowmax
Numeric scalar between
0and1. Maximum allowed missing-data proportion per row after injection.- colmax
Numeric scalar between
0and1. Maximum allowed missing-data proportion per column after injection.- na_col_subset
Optional integer or character vector restricting which columns are eligible for missing-value injection.
- max_attempts
Integer. Maximum number of resampling attempts per repetition before giving up.
- .progress
Logical. If
TRUE, show progress during tuning.- cores
Integer. Number of cores to use for K-NN and sliding-window K-NN imputation. For other methods, use
mirai::daemons().- location
Numeric vector of column locations. Required when
.f = "slide_imp".- pin_blas
Logical. If
TRUE, pin BLAS threads to 1 during parallel tuning to reduce thread contention.
Value
A data frame of class slideimp_tune containing:
columns originally provided in
parameters;param_set, an integer ID for each unique parameter combination;rep_id, an integer repetition index;result, a list-column where each element is a data frame containingtruthandestimatecolumns;error, a character column containing the error message if the iteration failed, otherwiseNA.
Details
Built-in methods can be selected by passing .f = "knn_imp",
.f = "pca_imp", or .f = "slide_imp". A custom function can also be
supplied. Custom functions must accept obj as their first argument and
return a numeric matrix with the same dimensions as obj.
When na_loc is supplied, num_na, n_cols, n_rows, and na_col_subset
are ignored.
When .f is a character string, columns in parameters are validated
against the selected method:
"knn_imp"requiresk."pca_imp"requiresncp."slide_imp"requireswindow_size,overlap_size, andmin_window_n, plus exactly one ofkorncp.
To tune parameters for grouped imputation, tune knn_imp() or pca_imp()
on representative groups, then pass the selected parameters to group_imp().
The top-level rowmax and colmax arguments control random missing-value
injection performed by sample_na_loc(). To tune or pass an imputation
method's own colmax argument, include a colmax column in parameters.
Tuning results can be summarized with compute_metrics() or evaluated with
external packages such as yardstick.
Parallelization
K-NN: use the
coresargument. Ifmiraidaemons are active,coresis automatically set to1to avoid nested parallelism.PCA: use
mirai::daemons()instead ofcores.
When running PCA imputation in parallel with mirai, set pin_blas = TRUE
in tune_imp() or group_imp() to prevent BLAS threads from
oversubscribing CPU cores. This relies on RhpcBLASctl and works with
OpenBLAS and MKL (typical on Linux, and on Windows after an OpenBLAS swap).
pin_blas = TRUE may have no effect on macOS.
PCA Performance tips
Speed comes from three levers: solver (through LOBPCG with warm-start),
threshold, and scale. Tune these first, then accuracy parameters
(ncp, coeff.ridge) on a representative subset.
Exact vs. LOBPCG with warm-start. Whether "lobpcg" beats "exact"
depends on size and low-rankness: "lobpcg" is preferred for large, approximately
low-rank matrices with small ncp, and "exact" for small matrices
(including slide_imp() windows), where it is faster and more robust.
Separately, the warm-start makes each successive solve cheap: pca_imp()
warm-starts LOBPCG with the previous eigenblock and search direction, so once
imputed values stabilize, later solves converge in a few iterations. The
payoff therefore grows with the number of EM iterations, independent of
low-rankness. solver = "auto" (default) probes both and is a safe start.
Threshold. The default 1e-6 is conservative; 1e-5 is often faster
with very similar values.
Scale. For columns on a common scale (e.g., DNAm beta values in
[0, 1]), scale = FALSE can be faster and more accurate.
Parallel and BLAS. In parallel via tune_imp() or group_imp() with a
multithreaded BLAS, set pin_blas = TRUE to avoid thread oversubscription.
On Windows, the stock BLAS can be slow. Advanced users can swap in
OpenBLAS.
See Speeding up PCA imputation for the full workflow.
Examples
set.seed(123)
# Simulate some data
obj <- sim_mat(10, 50)$input
# Tune K-NN imputation with random missing-value injection.
# Use larger `num_na` and `n_reps` values for real analyses.
params_knn <- data.frame(k = c(2, 4))
results <- tune_imp(
obj,
params_knn,
.f = "knn_imp",
n_reps = 1,
num_na = 10,
.progress = FALSE
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
compute_metrics(results)
#> k .progress param_set rep_id error n n_miss .metric .estimator .estimate
#> 1 2 FALSE 1 1 <NA> 10 0 mae standard 0.2075860
#> 2 2 FALSE 1 1 <NA> 10 0 rmse standard 0.2841912
#> 3 4 FALSE 2 1 <NA> 10 0 mae standard 0.1829599
#> 4 4 FALSE 2 1 <NA> 10 0 rmse standard 0.2202581
# Tune with fixed missing-value positions
na_positions <- list(
matrix(c(1, 2, 3, 1, 1, 1), ncol = 2),
matrix(c(2, 3, 4, 2, 2, 2), ncol = 2)
)
results_fixed <- tune_imp(
obj,
data.frame(k = 2),
.f = "knn_imp",
na_loc = na_positions,
.progress = FALSE
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
# Custom imputation function
custom_fill <- function(obj, val = 0) {
obj[is.na(obj)] <- val
obj
}
tune_imp(
obj,
data.frame(val = c(0, 1)),
.f = custom_fill,
num_na = 10,
.progress = FALSE
)
#> Tuning custom function
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
#> # slideimp table: 2 x 5
#> val param_set rep_id result error
#> 0 1 1 <df [10 x 2]> <NA>
#> 1 2 1 <df [10 x 2]> <NA>
if (FALSE) { # interactive() && requireNamespace("mirai", quietly = TRUE)
# Parallel tuning with mirai
mirai::daemons(2)
parameters_custom <- data.frame(mean = c(0, 1), sd = c(1, 1))
custom_imp <- function(obj, mean, sd) {
na_pos <- is.na(obj)
obj[na_pos] <- stats::rnorm(sum(na_pos), mean = mean, sd = sd)
obj
}
results_p <- tune_imp(
obj,
parameters_custom,
.f = custom_imp,
n_reps = 1,
num_na = 10,
.progress = FALSE
)
mirai::daemons(0)
}