Tune method specific hyperparameters by repeatedly masking observed values, imputing them, and comparing the imputed values with the original values.
Usage
tune_imp(
obj,
parameters = NULL,
.f,
na_loc = NULL,
num_na = NULL,
n_reps = 1,
n_cols = NULL,
n_rows = 2,
rowmax = 0.9,
colmax = 0.9,
na_col_subset = NULL,
max_attempts = 100,
.progress = TRUE,
cores = 1,
location = NULL,
pin_blas = FALSE
)Arguments
- obj
A numeric matrix.
- parameters
A
data.framespecifying parameter combinations to tune. Each column should be a parameter accepted by.f, excludingobj. List-columns are supported for complex parameters. Duplicate rows are removed.NULLis treated as a single parameter set with no additional arguments, which is useful for functions whose required arguments all have defaults.- .f
One of
"knn_imp","pca_imp", or"slide_imp", or a custom imputation function.- na_loc
Optional predefined missing-value locations. Accepted formats are a two-column integer matrix of row and column indices, a numeric vector of linear positions, or a list whose elements are either of those formats.
- num_na
Integer or
NULL. Total number of missing values to inject per repetition. If supplied,n_colsis derived fromnum_naandn_rows, and missing values are distributed as evenly as possible across columns.- n_reps
Integer. Number of independent repetitions.
- n_cols
Integer or
NULL. Must be supplied when bothnum_naandna_locareNULL, unless the automatic default applies.- n_rows
Integer. Target number of missing values to inject per selected column.
- rowmax
Numeric scalar between
0and1. Maximum allowed missing-data proportion per row after injection.- colmax
Numeric scalar between
0and1. Maximum allowed missing-data proportion per column after injection.- na_col_subset
Optional integer or character vector restricting which columns are eligible for missing-value injection.
- max_attempts
Integer. Maximum number of resampling attempts per repetition before giving up.
- .progress
Logical. If
TRUE, show progress during tuning.- cores
Integer. Number of cores to use for K-NN and sliding-window K-NN imputation. For other methods, use
mirai::daemons().- location
Numeric vector of column locations. Required when
.f = "slide_imp".- pin_blas
Logical. If
TRUE, pin BLAS threads to 1 during parallel tuning to reduce thread contention.
Value
A data frame of class slideimp_tune containing:
columns originally provided in
parameters;param_set, an integer ID for each unique parameter combination;rep_id, an integer repetition index;result, a list-column where each element is a data frame containingtruthandestimatecolumns;error, a character column containing the error message if the iteration failed, otherwiseNA.
Details
Built-in methods can be selected by passing .f = "knn_imp",
.f = "pca_imp", or .f = "slide_imp". A custom function can also be
supplied. Custom functions must accept obj as their first argument and
return a numeric matrix with the same dimensions as obj.
When na_loc is supplied, num_na, n_cols, n_rows, and na_col_subset
are ignored.
When .f is a character string, columns in parameters are validated
against the selected method:
"knn_imp"requiresk."pca_imp"requiresncp."slide_imp"requireswindow_size,overlap_size, andmin_window_n, plus exactly one ofkorncp.
To tune parameters for grouped imputation, tune knn_imp() or pca_imp()
on representative groups, then pass the selected parameters to group_imp().
The top-level rowmax and colmax arguments control random missing-value
injection performed by sample_na_loc(). To tune or pass an imputation
method's own colmax argument, include a colmax column in parameters.
Tuning results can be summarized with compute_metrics() or evaluated with
external packages such as yardstick.
Parallelization
K-NN: use the
coresargument. Ifmiraidaemons are active,coresis automatically set to1to avoid nested parallelism.PCA: use
mirai::daemons()instead ofcores.
When running PCA imputation in parallel with mirai, set pin_blas = TRUE
in tune_imp() or group_imp() to prevent BLAS threads from
oversubscribing CPU cores. This relies on RhpcBLASctl and works with
OpenBLAS and MKL (typical on Linux, and on Windows after an OpenBLAS swap).
pin_blas = TRUE may have no effect on macOS.
Performance tips
pca_imp() relies heavily on linear algebra. On Windows, the default BLAS
shipped with R may be slow for large matrices. Advanced users can replace
it with OpenBLAS.
PCA imputation speed depends on the eigensolver selected by solver and the
convergence threshold threshold. The exact solver is selected with
solver = "exact". The iterative LOBPCG solver is selected with
solver = "lobpcg". The default, solver = "auto", performs a short timed
probe and chooses LOBPCG only when it is clearly faster.
For large or approximately low-rank genomic matrices, it can be useful to
benchmark solver = "exact" against solver = "lobpcg" on a representative
subset, such as chromosome 22, before tuning accuracy-related parameters.
For slide_imp(), this may include window_size and overlap_size.
The default threshold = 1e-6 is conservative. In many genomic datasets,
threshold = 1e-5 can be faster while giving very similar imputed values.
Check this on a representative subset before using the relaxed threshold in a
full analysis.
See the pkgdown article Speeding up PCA imputation for a full workflow.
Examples
set.seed(123)
# Simulate some data
obj <- sim_mat(10, 50)$input
# Tune K-NN imputation with random missing-value injection.
# Use larger `num_na` and `n_reps` values for real analyses.
params_knn <- data.frame(k = c(2, 4))
results <- tune_imp(
obj,
params_knn,
.f = "knn_imp",
n_reps = 1,
num_na = 10,
.progress = FALSE
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
compute_metrics(results)
#> k .progress param_set rep_id error n n_miss .metric .estimator .estimate
#> 1 2 FALSE 1 1 <NA> 10 0 mae standard 0.2075860
#> 2 2 FALSE 1 1 <NA> 10 0 rmse standard 0.2841912
#> 3 4 FALSE 2 1 <NA> 10 0 mae standard 0.1829599
#> 4 4 FALSE 2 1 <NA> 10 0 rmse standard 0.2202581
# Tune with fixed missing-value positions
na_positions <- list(
matrix(c(1, 2, 3, 1, 1, 1), ncol = 2),
matrix(c(2, 3, 4, 2, 2, 2), ncol = 2)
)
results_fixed <- tune_imp(
obj,
data.frame(k = 2),
.f = "knn_imp",
na_loc = na_positions,
.progress = FALSE
)
#> Tuning `knn_imp()`
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
# Custom imputation function
custom_fill <- function(obj, val = 0) {
obj[is.na(obj)] <- val
obj
}
tune_imp(
obj,
data.frame(val = c(0, 1)),
.f = custom_fill,
num_na = 10,
.progress = FALSE
)
#> Tuning custom function
#> Step 1/2: Resolving NA locations
#> Running mode: sequential
#> Step 2/2: Tuning
#> # slideimp table: 2 x 5
#> val param_set rep_id result error
#> 0 1 1 <df [10 x 2]> <NA>
#> 1 2 1 <df [10 x 2]> <NA>
if (FALSE) { # interactive() && requireNamespace("mirai", quietly = TRUE)
# Parallel tuning with mirai
mirai::daemons(2)
parameters_custom <- data.frame(mean = c(0, 1), sd = c(1, 1))
custom_imp <- function(obj, mean, sd) {
na_pos <- is.na(obj)
obj[na_pos] <- stats::rnorm(sum(na_pos), mean = mean, sd = sd)
obj
}
results_p <- tune_imp(
obj,
parameters_custom,
.f = custom_imp,
n_reps = 1,
num_na = 10,
.progress = FALSE
)
mirai::daemons(0)
}