Skip to contents

Introduction

  • I wrote this package because I want to wrangle data in R, i.e., using data.table, tidymodels, tidyverse, spline basis functions generations with mgcv/gratia, and visualization with ggplot2 to name a few.
    • Importantly, mice and missForest in R are powerful imputation methods.
  • Meanwhile, the ML and DL ecosystem in python can be more mature and many latest updates and research papers only have python implementations.
    • I find that fitting a model in python can be faster and consume less memory for the equivalent model in R, possibly due to numpy and bleeding edge implementations.
  • rpwf allows wrangling data in R while having access to the latest and greatest ML packages in python.
  • In Kaggle competitions, it is usually feature engineering that gives a competitor an edge. rpwf helps by
    • Enabling the use of the workflowset framework. Workflow sets can be used as pre-defined test experiments that test how engineered features affect a model’s predictive performance.
    • Allowing seamless deployment on HPC clusters or cloud computers because the pathing and uploading of results is handled by (a) SQLite database(s) and the pins package.

Demonstration

  • Let’s demonstrate using the sim_classification() function from modeldata.
set.seed(123)
tmp_dir <- tempdir() # Temp folder
df <- modeldata::sim_classification(num_samples = 500, num_linear = 40)
df[1:6, 1:6]
#> # A tibble: 6 × 6
#>   class   two_factor_1 two_factor_2 non_linear_1 non_linear_2 non_linear_3
#>   <fct>          <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#> 1 class_2       -0.364       -1.08       -0.681         0.854        0.206
#> 2 class_2        0.292       -0.884      -0.711         0.666        0.943
#> 3 class_1        1.39         2.61       -0.702         0.733        0.379
#> 4 class_2       -0.354        0.535       0.0289        0.315        0.626
#> 5 class_2        1.06        -0.727      -0.0143        0.668        0.184
#> 6 class_1        2.26         2.15        0.233         0.464        0.659
  • Data is balanced.
table(df$class)
#> 
#> class_1 class_2 
#>     228     272
  • There’s no missing data.
sum(is.na(df))
#> [1] 0
  • Re-code class, our response variable, as integers
df$class <- as.integer(df$class) - 1L
  • Add an optional id column
df$id <- seq_len(nrow(df))

Initialize a database

board <- board_temp()
db_con <- rpwf_connect_db(paste(tmp_dir, "db.SQLite", sep = "/"), board)
  • db_con is an object that holds a DBI connection to the SQLite database that was just created. Access it with db_con$con.
db_con$con
#   Path: <path>/rpwfDb/db.SQLite
#   Extensions: TRUE

Define models with parsnip

  • Identical to {parsnips}, first choose a model, i.e., boost_tree(), then choose the R engine with set_engine() and classification or regression with set_mode().
  • Then, pipe the object into the set_py_engine() function.
  • set_py_engine() has 3 important parameters
    • py_module and py_base_learner defines how to import a base learner in a python script.
    • tag is an optional parameter that’s helpful for keeping track of models.
    • Arguments passed to the base learner in python can be passed to ... of set_py_engine()
  • Check the available models with rpwf_avail_models() and add another model with rpwf_add_py_model().
rpwf_avail_models(db_con) |> head()
#>              py_module    py_base_learner r_engine     model_mode
#> 1 sklearn.linear_model LogisticRegression   glmnet classification
#> 2 sklearn.linear_model         ElasticNet   glmnet     regression
#> 3          sklearn.svm                SVC  kernlab classification
#> 4          sklearn.svm                SVR  kernlab     regression
#> 5              xgboost      XGBClassifier  xgboost classification

xgboost

  • I fix the n_estimators at 50 and tune the learning rate. Other arguments can be found at the xgboost docs.
  • To do this, I pass the argument n_estimators = 50 to set_py_engine().
  • I am going to tune 6 hyper parameters by passing them the tune() functions just like in {parsnips}.
# This model is equivalent to the following python codes:

# from xgboost import XGBClassifier
#
# base_learner = XGBClassifier(
#   eval_metric = "logloss",
#   use_label_encoder = False,
#   verbosity = 0,
#   silent = True
# )

xgb_model <- boost_tree(
  mtry = tune(),
  min_n = tune(),
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  sample_size = tune()
) |>
  set_engine("xgboost") |>
  set_mode("classification") |>
  set_py_engine(
    "xgboost",
    "XGBClassifier",
    rpwf_model_tag = "xgboost",
    eval_metric = "logloss",
    n_estimators = 50,
    use_label_encoder = FALSE,
    verbosity = 0,
    silent = TRUE
  )

svm

  • From the sklearn.svm.SCV docs, the parameter cache_size can help speed up model fitting if memory is available. I will increase this from the default value. This is an example of how fitting models in python can have some very useful settings.
    • In this example, cache_size wouldn’t reduce fit time because the data is small.
  • Let’s set up a radial basis kernel svm model.
    • I have to fix the kernel parameter as rbf.
    • This is because tidymodels defines svm_poly() and svm_rbf() separately for polynomial basis svm and radial basis svm while sklearn.svm.SVC defines them both with the kernel parameter.
svm_rbf_model <- svm_rbf(
  cost = tune(),
  rbf_sigma = tune()
) |>
  set_engine("kernlab") |>
  set_mode("classification") |>
  set_py_engine(
    "sklearn.svm",
    "SVC",
    rpwf_model_tag = "svm_rbf",
    kernel = "rbf", # fix kernel parameter = "rbf"
    cache_size = 500
  )

glm

  • Let’s also fit an elastic net model.
enet_model <- logistic_reg(
  penalty = tune(),
  mixture = tune()
) |>
  set_engine("glmnet") |>
  set_mode("classification") |>
  set_py_engine(
    "sklearn.linear_model",
    "LogisticRegression",
    rpwf_model_tag = "glmnet",
    solver = "saga",
    penalty = "elasticnet",
    max_iter = 1000
  )

Hyper parameter tuning

  • The dials package provides
    1. sensible hyper parameters ranges
    2. functions that go beyond the random grid and regular grid such as dials::grid_max_entropy(), and dials::grid_latin_hypercube().
  • dials::grid_latin_hypercube() will be helpful for models with a lot of hyper parameters such as xgboost. But for svm_rbf_model, tuning just 2 hyper parameters on a 2-D grid with dials::grid_regular() would provide sufficient coverage of the hyper parameter space at an acceptable speed.
  • Updating the range of the hyper parameter space is similar to how it works in dials. Just provide the tuning functions (or create new ones) to the hyper_par_fun parameter.
  • Models specific tuning grids can be added at this step with set_r_grid().

xgboost

  • For the xgboost_model, let’s use a dials::grid_latin_hypercube().
  • Let’s limit max_depth. To do this, I add a named list to the hyper_par_fun parameter.
xgb_model <- xgb_model |>
  set_r_grid(
    grid_fun = dials::grid_latin_hypercube,
    hyper_par_fun = list(tree_depth = dials::tree_depth(range(2, 5))),
    size = 100
  )

svm

  • For the svm_rbf_model, let’s use a 2D regular grid.
svm_rbf_model <- svm_rbf_model |>
  set_r_grid(dials::grid_regular, levels = 10)

glm

  • Let’s also use a 2D regular grid for the enet model. However, I changed the range of the l1 penalty to allows for a greater regularization strength since scikit-learn inverse the penalty value for the logistic regression implementation.
enet_model <- enet_model |>
  set_r_grid(dials::grid_regular,
    list(penalty = dials::penalty(range = c(-8, 0.5))),
    levels = 10
  )

Define transformation pipelines with recipes

  • Recipes are defined as usual.
    • Use the formula or the role interface to specify the response and predictors.
    • The base recipe is used to gauge the baseline performance of each model.
    • The pca recipe is used to de-correlate the variables. step_pca() conveniently provides a parameter to keep an arbitrary threshold of the variance explained. I choose 95%.
  • rpwf reserves one optional special role that can be used with the update_role() function:
    • pd.index is a special role. It will mark a column for conversion into a pandas index in python.
    • Below, the column df$id will become the pandas.DataFrame index.
  • Pipe a recipe into rpwf_tag_recipe() to add a description to the recipe.
common <- recipe(class ~ ., data = df) |>
  step_mutate(class = as.integer(class)) |>
  update_role(id, new_role = "pd.index")

### xgb recipes
xgb_base_rec <- common |>
  rpwf_tag_recipe("xgb_base")

xgb_pca_rec <- xgb_base_rec |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(threshold = .95) |>
  rpwf_tag_recipe("xgb_pca")

### glm and svm recipes
scaled_base_rec <- common |>
  step_normalize(all_numeric_predictors()) |>
  rpwf_tag_recipe("scaled_base")

scaled_pca_rec <- scaled_base_rec |>
  step_pca(threshold = .95) |>
  rpwf_tag_recipe("scaled_pca")

Create workflowsets

rpwf_workflow_set()

  • The function rpwf_workflow_set() mimics workflowsets::workflow_set(). It creates a combination of all the provided recipes and models. Then, one can work with the resulting data.frame just like any data.frame (e.g., filtering out redundant workflows and etc.).
  • One workflow_set for xgboost and one for svm and glm are created and rbind() into one final workflow_set.
  • The cost parameter is to specify which measure of predictive performance is optimized for. Look up the values in the scikit-learn docs. Custom cost functions are possible but would require coding on the python side.
### xgboost workflow_set
xgb_wfs <- rpwf_workflow_set(
  preprocs = list(xgb_base_rec, xgb_pca_rec),
  models = list(xgb_model),
  cost = "roc_auc"
)

### svm and glm workflow_set
svm_glm_wfs <- rpwf_workflow_set(
  preprocs = list(scaled_base_rec, scaled_pca_rec),
  models = list(svm_rbf_model, enet_model),
  cost = "roc_auc"
)

### combined workflow_set
all_wfs <- rbind(xgb_wfs, svm_glm_wfs)

rpwf_augment()

  • rpwf_augment() is a wrapper function for many tasks. But most importantly, it generates the hyper parameter grids in R and transform these grids to make them compatible with sklearn’s API. For example the following conversions were done:
    • The mtry positive integer in R is converted into sklearn’s colsample_bytree positive proportions.
    • The penalty in R is reciprocated in sklearn logistic regression. This is why in order to have the l1 penalty > 1 (which is too large most of the time), the range upper bound is changed to be positive and smaller than 1.
all_wfs <- rpwf_augment(all_wfs, db_con)
all_wfs |>
  dplyr::select(model_tag, recipe_tag, costs)
#> # A tibble: 6 × 3
#>   model_tag recipe_tag  costs  
#>   <chr>     <chr>       <chr>  
#> 1 xgboost   xgb_base    roc_auc
#> 2 xgboost   xgb_pca     roc_auc
#> 3 svm_rbf   scaled_base roc_auc
#> 4 glmnet    scaled_base roc_auc
#> 5 svm_rbf   scaled_pca  roc_auc
#> 6 glmnet    scaled_pca  roc_auc
  • Checking the generated grids, we can see that the names of the hyper parameters have been renamed to conform to the scikit-learn API.
sapply(unique(all_wfs$grids), head)
#> [[1]]
#> # A tibble: 6 × 6
#>   colsample_bytree min_child_weight max_depth learning_rate        gamma subsa…¹
#>              <dbl>            <int>     <int>         <dbl>        <dbl>   <dbl>
#> 1            0.533                7         5       0.00437      2.51e-2   0.478
#> 2            0.244               31         4       0.191        1.69e-9   0.835
#> 3            0.511                7         4       0.260        2.23e-8   0.996
#> 4            0.733               20         2       0.0257       5.94e-4   0.225
#> 5            0.711               32         4       0.00658      1.46e+0   0.309
#> 6            0.911                3         4       0.00111      9.41e-7   0.642
#> # … with abbreviated variable name ¹​subsample
#> 
#> [[2]]
#> # A tibble: 6 × 2
#>         C   gamma
#>     <dbl>   <dbl>
#> 1 1024    0.00862
#> 2  323.   0.00862
#> 3  102.   0.00862
#> 4   32    0.00862
#> 5   10.1  0.00862
#> 6    3.17 0.00862
#> 
#> [[3]]
#> # A tibble: 6 × 2
#>            C l1_ratio
#>        <dbl>    <dbl>
#> 1 100000000      0.05
#> 2  11364637.     0.05
#> 3   1291550.     0.05
#> 4    146780.     0.05
#> 5     16681.     0.05
#> 6      1896.     0.05
  • Here are the dimensions of the grids. This is always good to check to make sure we didn’t accidentally make a grid that’s too big.
sapply(unique(all_wfs$grids), nrow)
#> [1] 100 100 100

Export data as parquets and add to database.

  • There are two types of parquets files: 1) hyper param grids, and 2) train/test data.
  • rpwf_write_grid() and rpwf_write_df() write the parquets.
  • Because this function only generate a data.frame if its not already written, running each of these functions in parallel is not recommended.
    • To get around this, one can either work with a manageable number of workflows at a time, or split the work into multiple different databases and run the export functions in parallel over the databases.
rpwf_write_grid(all_wfs)
rpwf_write_df(all_wfs)
  • Export the board information as a YAML file.
  • Then finally, export the meta data into the database with rpwf_export_db().
rpwf_write_board_yaml(board, paste(tmp_dir, "board.yml", sep = "/"))
rpwf_export_db(all_wfs, db_con)
#> [1] 6

Run the workflow in python

  • The rpwf python codes contains scripts that performs model fitting. These are also templates to experiment further using the data generated in R.
  • For example, to use the nested_resampling.py script, in the terminal, run the following command to get the list of arguments
python -m rpwf.script.nested_resampling -h
  • Or the following in a Jupyter notebook cell
%run -m rpwf.script.nested_resampling -h
  • This following command display the workflows we exported.
%run -m rpwf.script.nested_resampling $<path to the db> -b $<path to board yaml> -s
  • The positional argument is the path to the database.
  • The -b flag is the path to the exported board YAML file associated with this db.
  • The -s flag shows the workflows present in the database.
  • The other important flags are
    • The -a flag runs all the workflow that hasn’t been run.
    • The -w flag accept a list of ids (i.e., 1 2 3 4) to specify which workflow to run.
    • The -f flag force a run and overwrite the results of a workflow that has already been run.
    • The -c flag indicates the number of CPU cores dedicated to the model fitting task.
    • The -icv and -icr flags are number of vfold-cv and repeats for the inner loop used for hyper parameters tuning.
    • The -ocv and -ocr flags are number of vfold-cv and repeats for the outer loop used for testing the predictive performance.
  • Let’s run the nested CV with 5 CV * 2 repeats for the inner loop and 5 CV * 2 repeats on the outer loop.
%run -m rpwf.script.nested_resampling $<path to the db> -b $<path to board yaml> \
  -a -c 7 -icv 5 -icr 2 -ocv 5 -ocr 2

Visualize the results

  • Results can be imported back into R by passing the db_con object to rpwf_results().
fit_results <- rpwf_results(db_con)
  • We can now just manipulate the results with R.
fit_results |>
  tidyr::unnest(fit_results) |>
  ggplot(aes(y = roc_auc, x = recipe_tag, color = recipe_tag)) +
  geom_boxplot() +
  geom_jitter() +
  facet_wrap(~model_tag, scale = "free_x") +
  theme_bw()