Export datasets from Foundry with API

jknight · June 3, 2024, 7:17am

Hi all,

I’m wondering what is the most straightforward way to export datasets from Foundry to an external system (e.g. S3) via API calls.
I understand that the recommended way to do Exports is with Data Connection Exports (https://www.palantir.com/docs/foundry/data-connection/export-overview/), but I am looking for a way to do exports through an API since I have an external system built up that already heavily interacts with Foundry via APIs. I am looking for the most vanilla way to integrate the “export dataset” flow there.

Thanks!

jknight · June 3, 2024, 8:01am

I found this: https://www.palantir.com/docs/foundry/data-integration/foundry-s3-api#aws-sdk-for-python-boto3 → It seems to offer a way to be able to interact with datasets stored in Foundry in a codebase externally. Looks like PutObject would do what I want: https://www.palantir.com/docs/foundry/data-integration/foundry-s3-api#supported-actions
Not sure if this is the recommended way to do exports through an API though.

comew · June 3, 2024, 8:28am

If you need more flexibility than offered by the Data Connection Exports, then External Transforms are the recommended way to call APIs from Foundry. It can be configured to allow exporting data: https://www.palantir.com/docs/foundry/data-integration/external-transforms-source-based/#configure-export-controls-on-the-source

VincentF · June 3, 2024, 8:33am

It depends on which way you see the API working:

If you want Foundry to push to an API, then you can use External Transforms
If you want Foundry to expose an API for something downstream to pull files, then you can use the Foundry S3 API

In the first case, using External Transforms: you can ingest data but as well export data, as you can read from datasets as well in those transforms. You will then have full control over the API calls you want to perform.
You might need some docs about how to access files in a transforms, as well.

Untested code, unlikely it will work as-is, but it should give you a high level overview of what is possible:

import boto3
import pandas as pd
from transforms.api import transform, Input, Output
from transforms.external.systems import use_external_systems, ExportControl, EgressPolicy
from pyspark.sql.functions import udf
import shutil


@use_external_systems(
    export_control=ExportControl(markings=['<marking ID of the resource intended to be exported>']),
    egress=EgressPolicy(<policy RID to the S3 bucket to upload data to>),
)
@transform(
    # output_data=Output('/path/to/output/dataset'), ## useful if you want to log what file has been sent, etc.
    input_data=Input('/path/to/input/dataset'),
)
def compute(export_control, egress, output_data, input_data):
    # Prepare the list of files to process
    fs = input_data.filesystem()
    file_paths = [f.path for f in fs.ls()]

    # Prepare the access to the S3 bucket
    # Replace the following placeholders with your S3 bucket information
    s3_bucket_name = "<YOUR_S3_BUCKET_NAME>"
    s3_key_prefix = "<YOUR_S3_KEY_PREFIX>"
    s3 = boto3.client(
          's3',
          aws_access_key_id="<ACCESS_KEY_ID>",
          aws_secret_access_key="<SECRET_ACCESS_KEY>",
          endpoint_url="URL_OF_S3",
          region_name="THE_REGION"
      )
    bucket = 'ri.foundry.main.dataset.<uuid>'

    # Process each file
    # Note: you could use a rdd.flatMap() to process each file in parallel. 
    for file_path in file_paths:
        with fs.open(file_path, "rb") as f:
            # TODO: read the file here, depends on the type
            df = pd.read_csv(f)
    
            # Upload the file to the S3 bucket
            s3.upload_fileobj(df, s3_bucket_name, f"{s3_key_prefix}/{file_path}")

jknight · June 3, 2024, 11:36am

Thanks for the comprehensive answers. In my case, I believe I have to go with the Foundry S3 API. This is because a further restriction in my context is that due to organizational and process reasons, the code has to live in the external customer-managed system I’ve mentioned (i.e. not Foundry). External Transforms relies on Foundry primitives and is therefore not an option in this case.
The way I understand it, the Foundry S3 API is a good solution for my problem though - I should be able to use code like this (https://www.palantir.com/docs/foundry/data-integration/foundry-s3-api#aws-sdk-for-python-boto3) in my external system after running through the first 3 steps of the Setup guide (https://www.palantir.com/docs/foundry/data-integration/foundry-s3-api#setup-guide).