Skip to contents

Introduction

After generating the data in R and exporting the necessary information, the following codes can be used to get the data in python.

Generate data in R

If you want to just transform your data in R and export it to python, the pins or reticulate package works just fine. But here is how you can do it with rpwf, which wraps around pins.

library(rpwf)
library(pins)
library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
tmp_dir <- tempdir() # Temp folder
board <- board_temp()
db_con <- rpwf_connect_db(paste(tmp_dir, "db.SQLite", sep = "/"), board)
r <- recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors()) |>
  rpwf_tag_recipe("r")

r1 <- r |>
  step_YeoJohnson(all_numeric_predictors()) |>
  rpwf_tag_recipe("r1")

d <- rpwf_data_set(r, r1, db_con = db_con)
#> No pandas idx added. Use update_roles() with 'pd.index' for one
#> No pandas idx added. Use update_roles() with 'pd.index' for one
  • Write the transformed data, export the metadata to the database, and write the board YAML file.
rpwf_write_df(d)
#> Creating new version '20221219T051130Z-06963'
#> Writing to pin 'df.b1b6afb83db8b5cd2753f5b454bf7774.parquet'
#> Creating new version '20221219T051130Z-4a858'
#> Writing to pin 'df.a8472f0060dc2de7b5f2701fa91f07e0.parquet'
rpwf_export_db(d, db_con)
#> Exporting workflows to db...
#> [1] 2
rpwf_write_board_yaml(board, paste(tmp_dir, "board.yml", sep = "/"))

Get the data in python

  • Import the modules
  • Create a board from the written yml file and a database object
from rpwf import database, rpwf
from pathlib import Path

db_path = # <replace with tmp_dir>
board_yml = # <replace with paste(tmp_dir, "board.yml", sep = "/")>

db_obj = database.Base(db_path)
board_obj = database.Board(board_yml)
  • See all the exported wflow as follows
db_obj.all_wflow()

# wflow_id    model_tag   recipe_tag    result_pin_name   model_pin_name
# 1           None        r             None              None
# 2           None        r1            None              None
  • Pick a wflow_id, and create a rpwf.Wflow object associated with that wflow_id
wflow_id = 2
wflow_obj = rpwf.Wflow(db_obj, board_obj, wflow_id)
  • Finally create a rpwf.TrainDf object and use the get_df_X and get_df_y methods to get the train pandas.DataFrame and response pandas.Series
df_obj = rpwf.TrainDf(db_obj, board_obj, wflow_obj)
X, y = df_obj.get_df_X(), df_obj.get_df_y()