R code execution in Code Repositories without RStudio licence

We would like integrate R code Palantir into Code Repositories for its execution in pipelines. Of course, we know option to select RStudio workspace as Code Repositories template, however we do not have RStudio licence. Palantir AIP and documentation explain that it is possible to have it by manual configuration.

So far we tried, e.g.:

  • follow Palantir documentation to Write and configure a simple R transform in empty or Python-based Code Repositories template. However datasets are not build because pipelines (YML in .transforms directory) are not recognized. Sometimes I get exception about Could not determine the dependencies of task ':transforms-python:condaTest' (missing conda-locks/conda-versions.run.linux-64.lock)
  • use rpy2, but I get Error: rpy2 in API mode cannot be built without R in the PATH or R_HOME defined. Correct this or force ABI mode-only by defining the environment variable RPY2_CFFI_MODE=ABI. Adding os.environ['RPY2_CFFI_MODE'] = 'ABI' did not help.

Even adding foundry-transforms-lib-r, r-base, r-renv, r-essentials in meta.yaml did not help.

Does it is possible to run R code (to compile datasets) in Code Repositories without RStudio license? If so, how configure Code Repositories for this?

1 Like

Hi @embar,

since r-base is available on conda-forge you can add it as a regular dependency to your classic Code Repository. In addition, the packages rpy2 will be convenient to translate between the python and R layer.

I have created a hello world example that should get you started on the idea. Developing in in-platform VSCode also works.

Please note that this is a very simple example, but we do have similar code running in production since a few years. If you need libraries you can add them from conda-forge, usually you can take the name from CRAN and append r-, for example r-dplyr.

meta.yaml file:

package:
  name: "{{ PACKAGE_NAME }}"
  version: "{{ PACKAGE_VERSION }}"

source:
  path: ../src

requirements:
  build:
    - python
    - setuptools

  run:
    - python
    - transforms {{ PYTHON_TRANSFORMS_VERSION }}
    - transforms-preview
    - pyarrow    
    - r-base 4.3.*
    - rpy2
    - pandas

build:
  script: python setup.py install --single-version-externally-managed --record=record.txt

Sample transform file:

from pandas import DataFrame
from transforms.api import transform_pandas, configure
from transforms.api import (
    Input,
    Output,
)


@configure(profile=["KUBERNETES_NO_EXECUTORS", "ARROW_ENABLED"])
@transform_pandas(
    Output("ri.foundry.main.dataset.3525e70f-fb1e-4c25-8b58-cb42b6df1c06"),
    iris=Input("ri.foundry.main.dataset.333a7732-87f7-4d68-8867-fddee60aebf3"),
)
def run_with_r(iris: DataFrame) -> DataFrame:
    # Imports needs to be inside the transform, otherwise Checks will fail
    # with "ValueError: openrlib.R_HOME cannot be None.""
    import rpy2.robjects as ro
    from rpy2.robjects import pandas2ri
    from rpy2.robjects.conversion import localconverter

    _ = ro.r("""
        modify_data <- function(df) {
            df <- df[1:10,]
            
            return(df)
            }
        """)

    with localconverter(ro.default_converter + pandas2ri.converter):
        r_function = ro.globalenv["modify_data"]
        return_df = r_function(iris)

    return return_df

Note that rpy2 offers different possibilities to run your R code, either by calling everything from Python through bindings or by “sourcing” raw R code which you could stores in your repository. If you want to store files with *.R ending you need to add the following lines to your setup.py:

    package_data={
        '': ['*.R']
    },

If you do serious work in R you probably need to increase the DRIVER_MEMORY_OVERHEAD using Spark Profiles. I have added ARROW_ENABLED to speedup the conversion into the pandas dataframe.

1 Like

Thank you, @ nicornk, such a quick and detailed answer!

However, we prefer to use Polars instead of Pandas due to its significantly better computation speed and lower memory consumption.

I initially tried to provide a solution using lightweight transforms - however lightweight does not properly pack and unpack the conda environments so using R from conda-forge fails.

One thing you could do is download all input parquet files to a temporary directory (lightweight does nothing else btw.) and use r-arrow‘s read_parquet to read it as data frame from within the R context. No need to translate from Python to R.
You code write your result with write_parquet and upload the parquet files using the filesystem - however you would manually have to upload a schema.

Another option would be to build your own docker image outside of foundry (with R), push it to Foundry and use it in lightweight. That should workaround the conda-pack limitations.

1 Like

Your provided solution indeed works! I tested your proposed code, seems @configure is not critical (and actually, I had no right to re-configure). I also had to modify it to input/output PySpark DataFrame and not to crash having Null/NaN values in string columns:

from transforms.api import configure, transform_df, Input, Output


@configure(profile=["ARROW_ENABLED"])
@transform_df(
    Output("ri.foundry.main.dataset.output"),
    source_df=Input("ri.foundry.main.dataset.input"),
)
def run_with_r(ctx, source_df):
    # Imports needs to be inside the transform, otherwise Checks will fail
    # with "ValueError: openrlib.R_HOME cannot be None.""
    import rpy2.robjects as ro
    from rpy2.robjects import pandas2ri
    from rpy2.robjects.conversion import localconverter

    # Define R function
    _ = ro.r("""
        modify_data <- function(df) {
            df <- df[1:10,]
            return(df)
            }
        """)

    # Optionally, replace null values, otherwise you will get errors like:
    # - pyspark.errors.exceptions.base.PySparkTypeError: [CANNOT_MERGE_TYPE] Can not merge type `StructType` and `StringType`
    # - _pickle.PicklingError: Could not serialize object: RRuntimeError: Error in
    #   (function (object, connection, ascii = FALSE, xdr = TRUE, version = NULL,  : unimplemented type 'char' in 'eval'
    source_df = source_df.fillna("NA_character_")

    # Convert PySpark DataFrame to interim Pandas DataFrame (to be able convert to R DataFrame)
    source_pandas_df = source_df.toPandas()

    # Apply R function
    with localconverter(ro.default_converter + pandas2ri.converter):
        r_function = ro.globalenv["modify_data"]
        result_pandas_df = r_function(source_pandas_df)

    # Convert the result back to a PySpark DataFrame
    result_spark_df = ctx.spark_session.createDataFrame(result_pandas_df)

    # Optionally, convert back original Null values, saved as "NA_character_", to Null
    result_spark_df = result_spark_df.replace({"NA_character_": None})

    return result_spark_df

1 Like

Dear @nicornk, You mention that

<…> rpy2 offers different possibilities to run your R code, either <…> or by “sourcing” raw R code which you could stores in your repository.

However, such a code line (or any other absolute or relative path to *.R that I put)

rpy2.robjects.r.source("./my_custom_script.R")

results into error: rpy2.rinterface_lib.embedded.RRuntimeError: Error in file(filename, "r", encoding = encoding) : cannot open the connection

Even if I use simplest demo.R:

modify_data <- function(df) {
df <- df[1:10,]

return(df)
}

and try read it:

r_script = resource_stream(
        __name__, "./demo.R"
    ).read().decode('utf-8')

later at _ = ro.r(r_script) line I get error: rpy2.rinterface_lib._rinterface_capi.RParsingError: Parsing status not OK - PARSING_STATUS.PARSE_ERROR

Did you succeeded to read R file stored in Code Repositories?

Edit:

R code reading solved by

    # Read R script
    r_script = resource_stream(
        __name__, "./demo.R"
    ).read().decode('utf-8')
    r_script = r_script.replace('\r\n', '\n')
    _ = ro.r(f"{r_script}")