Speeding up PCA imputation
Source:vignettes/articles/speeding-up-pca-imputation.Rmd
speeding-up-pca-imputation.RmdTips to speed up PCA imputation
pca_imp() is the PCA imputation workhorse used by
slide_imp(), tune_imp(), and
group_imp() whenever PCA imputation is requested. This
article describes how to make PCA imputation faster by choosing the
convergence threshold, the PCA eigensolver, and a parallel backend with
mirai.
Overview
Most users can keep the defaults:
set.seed(1234)
sim <- sim_mat(
n = 80,
p = 1500,
rho = 0.7,
perc_total_na = 0.10,
perc_col_na = 0.8
)
sim_df <- sim$col_group
obj <- sim$input
pca_imp(obj, ncp = 2)
#> Method: PCA imputation
#> Dimensions: 80 x 1500
#>
#> feature1 feature2 feature3 feature4 feature5 feature6
#> sample1 0.25105224 0.2690159 0.22175022 0.3808861 0.1697869 0.3006772
#> sample2 0.46488252 0.5570005 0.56877205 0.6171557 0.5943715 0.6692621
#> sample3 0.46781185 0.6658607 0.53191864 0.6945785 0.6875948 0.5242096
#> sample4 0.07870941 0.1605554 0.07784548 0.2108805 0.0000000 0.2506738
#> sample5 0.52579469 0.6477465 0.63987522 0.6117717 0.6680913 0.6045150
#> sample6 0.58204683 0.6815553 0.59358698 0.5741023 0.6504755 0.8734424
#> # Showing 6 of 80 rows and 6 of 1500 columnsHowever, for large genomic matrices, runtime can depend strongly on two arguments:
-
threshold: the convergence threshold for the PCA imputation algorithm.- The default is
threshold = 1e-6, matching the conservative behavior of missMDA. - For many genomic matrices,
threshold = 1e-5can be much faster while producing very similar imputed values.
- The default is
-
solver: the eigensolver used insidepca_imp().-
solver = "auto"runs a short timed probe of both solvers and picks the faster one. -
solver = "exact"forces the exact solver. -
solver = "lobpcg"forces the iterative LOBPCG solver.
-
Exact versus iterative PCA solvers
pca_imp() supports two PCA eigensolvers:
- The exact solver, selected by
solver = "exact". - The iterative LOBPCG solver, selected by
solver = "lobpcg".
The exact solver is robust and often fast for small to moderate matrices. The iterative LOBPCG solver can be faster for large or approximately low-rank genomic matrices, especially when only a small number of principal components is needed.
The default solver = "auto" runs a small number of probe
iterations of each solver at the start of the run, measures wall-clock
time, and picks LOBPCG only if it is clearly faster than the exact
solver. This decision is data-dependent and is made per call.
For most users, solver = "auto" is a good default.
However, manual benchmarking is still useful in two situations:
- Before a long
tune_imp()run to avoid the cost of repeated probing withsolver = "auto". - When you suspect the auto choice is wrong (for example, very large or very low-rank matrices where LOBPCG should clearly win, or very small windows where the exact solver should clearly win).
Recommended workflow
A practical workflow is:
- Choose a representative subset of the data, for example chromosome 22.
- Time the
exactandlobpcgsolvers on the subset at a fixedthreshold. - Use
tune_imp()to check that a relaxedthresholddoes not materially change cross-validation error. - Use the chosen
solverandthresholdwhen tuning accuracy-related parameters such asncp,coeff.ridge,window_size, andoverlap_size.
We demonstrate this workflow as follows.
Create a representative test matrix
In a real DNA methylation analysis, use a representative chromosome
or subset. Here, we treat group1 of the simulated data as
the representative chromosome.
Time solver and threshold settings
First, choose a small ncp for the speed probe. This does
not need to be the final value of ncp. The goal is to
identify whether the exact or iterative solver is faster on this kind of
data. We use a relaxed threshold = 1e-5 to keep the probe
quick. Both solvers use the same convergence criterion, so their
relative timing should be representative.
ncp_probe <- 5
exact <- system.time(invisible(
pca_imp(
obj = obj_probe,
ncp = ncp_probe,
solver = "exact",
threshold = 1e-5,
na_check = FALSE
)
))
iterative <- system.time(invisible(
pca_imp(
obj = obj_probe,
ncp = ncp_probe,
solver = "lobpcg",
threshold = 1e-5,
na_check = FALSE
)
))
exact - iterative
#> user system elapsed
#> -0.006 -0.002 -0.001Second, use tune_imp() to compare cross-validation error
across threshold values. In this example, relaxing the
threshold to 1e-5 does not meaningfully change the
error.
thresh_df <- data.frame(threshold = c(1e-5, 1e-6), ncp = ncp_probe)
res <- tune_imp(obj = obj_probe, parameters = thresh_df, .f = "pca_imp")
#> Tuning `pca_imp()`
#> Step 1/2: Resolving NA locations
#> ℹ Using default `num_na` = 500 (~0.9% of cells).
#> Increase for more reliability or decrease if missing is dense.
#> Running mode: sequential
#> Step 2/2: Tuning
compute_metrics(res)
#> threshold ncp param_set rep_id error n n_miss .metric .estimator .estimate
#> 1 1e-05 5 1 1 <NA> 500 0 mae standard 0.09174255
#> 2 1e-05 5 1 1 <NA> 500 0 rmse standard 0.11499555
#> 3 1e-06 5 2 1 <NA> 500 0 mae standard 0.09174508
#> 4 1e-06 5 2 1 <NA> 500 0 rmse standard 0.11500090If threshold = 1e-5 changes cross-validation error or
imputed values more than expected, use the more conservative
threshold = 1e-6.
If neither solver is clearly faster on the representative subset,
pick solver = "exact" for robustness. If one solver is
clearly faster, force it explicitly to skip the auto probe overhead.
Use the chosen settings with tune_imp()
After choosing solver and threshold, keep
them fixed while tuning accuracy-related PCA parameters (i.e.,
ncp and coeff.ridge):
chosen_solver <- "exact"
chosen_threshold <- 1e-5
params <- expand.grid(
ncp = c(2, 4, 6),
coeff.ridge = c(0.8, 1, 1.2)
)
params$solver <- chosen_solver
params$threshold <- chosen_threshold
tune_pca <- tune_imp(
obj = obj_probe,
.f = "pca_imp",
parameters = params,
n_reps = 1 # <- increase to a higher number in real analyses
)
#> Tuning `pca_imp()`
#> Step 1/2: Resolving NA locations
#> ℹ Using default `num_na` = 500 (~0.9% of cells).
#> Increase for more reliability or decrease if missing is dense.
#> Running mode: sequential
#> Step 2/2: Tuning
metrics <- compute_metrics(tune_pca, metrics = "rmse")
aggregate(
.estimate ~ ncp + coeff.ridge,
data = metrics,
FUN = mean
)
#> ncp coeff.ridge .estimate
#> 1 2 0.8 0.1125669
#> 2 4 0.8 0.1132658
#> 3 6 0.8 0.1148464
#> 4 2 1.0 0.1123355
#> 5 4 1.0 0.1128064
#> 6 6 1.0 0.1138270
#> 7 2 1.2 0.1121501
#> 8 4 1.2 0.1124116
#> 9 6 1.2 0.1130121Use the chosen settings with slide_imp()
The same idea applies when tuning sliding-window PCA imputation.
First choose the speed settings on a representative subset, then include
those fixed settings in the parameters data frame while
tuning window and PCA parameters.
slide_imp() typically operates on small windows with few
samples, so the "exact" solver is usually the better
default. If solver = "auto", slide_imp() will
only probe the first window and force the solver for the rest of the
windows.
beta_matrix <- obj_probe # Treat `obj_probe` as a `slide_imp()` input
locations <- seq_len(ncol(beta_matrix))
slide_params <- expand.grid(
ncp = c(2, 4),
coeff.ridge = c(0.8, 1, 1.2),
window_size = c(50, 500) # Adjust based on units of `locations`
)
slide_params$overlap_size <- 5
slide_params$min_window_n <- 20
slide_params$solver <- "exact" # Usually better for `slide_imp()`
slide_params$threshold <- chosen_threshold
tune_slide_pca <- tune_imp(
obj = beta_matrix,
.f = "slide_imp",
parameters = slide_params,
n_reps = 1, # <- increase to a higher number in real analyses
location = locations
)
#> Tuning `slide_imp()`
#> Step 1/2: Resolving NA locations
#> ℹ Using default `num_na` = 500 (~0.9% of cells).
#> Increase for more reliability or decrease if missing is dense.
#> Running mode: sequential
#> Step 2/2: Tuning
slide_metrics <- compute_metrics(tune_slide_pca, metrics = "rmse")
aggregate(
.estimate ~ ncp + coeff.ridge + window_size,
data = slide_metrics,
FUN = mean
)
#> ncp coeff.ridge window_size .estimate
#> 1 2 0.8 50 0.1218555
#> 2 4 0.8 50 0.1257166
#> 3 2 1.0 50 0.1213016
#> 4 4 1.0 50 0.1242115
#> 5 2 1.2 50 0.1208738
#> 6 4 1.2 50 0.1228126
#> 7 2 0.8 500 0.1170125
#> 8 4 0.8 500 0.1185850
#> 9 2 1.0 500 0.1168271
#> 10 4 1.0 500 0.1179256
#> 11 2 1.2 500 0.1166733
#> 12 4 1.2 500 0.1173506Finally, use the selected speed and accuracy parameters when imputing the full dataset:
slide_results <- slide_imp(
obj = beta_matrix,
location = locations,
window_size = 500,
overlap_size = 5,
min_window_n = 20,
ncp = 2,
coeff.ridge = 1.2,
solver = "exact", # Usually better for `slide_imp()`
threshold = chosen_threshold,
.progress = FALSE
)Parallel PCA imputation and BLAS threading
PCA imputation can also be sped up by running independent imputation
tasks in parallel with tune_imp() or
group_imp(). For PCA imputation, we use
mirai parallelization rather than the cores
argument.
library(mirai)
# Start 4 mirai workers.
mirai::daemons(4)
# Run PCA tuning in parallel.
tune_pca <- tune_imp(
obj = obj_probe,
.f = "pca_imp",
parameters = params,
n_reps = 1,
.progress = FALSE # <- Use `TRUE` to monitor longer-running jobs
)
#> Tuning `pca_imp()`
#> Step 1/2: Resolving NA locations
#> ℹ Using default `num_na` = 500 (~0.9% of cells).
#> Increase for more reliability or decrease if missing is dense.
#> Running mode: mirai
#> Step 2/2: Tuning
#> Tip: set `pin_blas = TRUE` may improve parallel performance.
# Stop mirai workers when finished.
mirai::daemons(0)When PCA imputation is run in parallel, each worker may call multithreaded linear algebra routines through BLAS. If each worker uses multiple BLAS threads, the total number of active CPU threads can become much larger than the number of physical cores. This is called thread oversubscription, and it can make code slower rather than faster.
For example, 8 mirai workers each using 8 BLAS threads can try to use 64 CPU threads.
To avoid this, set pin_blas = TRUE in
tune_imp() or group_imp() when running PCA
imputation with mirai. This requires
RhpcBLASctl:
mirai::daemons(4)
tune_slide_pca <- tune_imp(
obj = beta_matrix,
.f = "slide_imp",
parameters = slide_params,
n_reps = 1,
location = locations,
pin_blas = TRUE # <- Set to `TRUE` if R is linked to multi-threaded BLAS/LAPACK
)
#> Tuning `slide_imp()`
#> Step 1/2: Resolving NA locations
#> ℹ Using default `num_na` = 500 (~0.9% of cells).
#> Increase for more reliability or decrease if missing is dense.
#> Running mode: mirai
#> Step 2/2: Tuning
mirai::daemons(0)For group-wise PCA imputation:
mirai::daemons(4)
pca_group_results <- group_imp(
obj = obj,
group = sim_df,
ncp = 2,
solver = chosen_solver,
threshold = chosen_threshold,
pin_blas = TRUE # <- Set to `TRUE` if R is linked to multi-threaded BLAS/LAPACK
)
#> Imputing 2 groups using PCA.
#> Running mode: mirai
mirai::daemons(0)pin_blas = TRUE uses RhpcBLASctl to limit
the number of BLAS threads inside parallel workers. This works with
common BLAS libraries such as OpenBLAS and MKL, which are typical on
Linux and are also available on Windows after replacing R’s default
BLAS. It may have no effect on macOS, depending on the BLAS
implementation.
BLAS/LAPACK swapping for advanced Windows users
On Windows machines, R’s default BLAS/LAPACK is reliable but single-threaded and can be slow. Advanced users can get much faster PCA imputation by replacing R’s default BLAS with an optimized BLAS following this guide: OpenBLAS for R on Windows.
However, OpenBLAS is multithreaded, so for parallel runs using
mirai, set pin_blas = TRUE to avoid thread
oversubscription.
Practical recommendations
- Start with
solver = "auto"unless PCA imputation is slow or you are about to run many calls, such as withtune_imp(), where the per-call probe overhead can add up. - If PCA imputation is slow, benchmark
solver = "exact"againstsolver = "lobpcg"on a representative subset and force the faster solver. - Try
threshold = 1e-5or a less stringent value, depending on your error tolerance. - Keep
threshold = 1e-5only if it produces similar imputed values or similar cross-validation error. - LOBPCG is most likely to help when the matrix is large,
approximately low-rank, and
ncpis small. - For
slide_imp(),solver = "exact"is usually the right choice. - Choose speed settings first, then tune accuracy-related parameters
such as
ncp,coeff.ridge, andwindow_size. - If running PCA imputation sequentially, BLAS multithreading may
help, and
pin_blasis unnecessary. - If running many PCA imputations in parallel with
mirai, use
pin_blas = TRUE. - Avoid using many mirai workers and many BLAS threads at the same time.
- Benchmark on a representative subset, because the best setup depends on the hardware, BLAS library, matrix size, and missing-data pattern.