Lightweight Transforms and the S3 API for improved performance

Does anyone have experience with using lightweight transforms (DuckDB) and the S3 API? I watched this discussion between Chad and Nicolas with great interest, where this combination was emphasized but without specific examples about how to implement this.

I’m very familiar with these APIs and excited about the possibility that I’m missing something - and wanting to learn more. Would appreciate any feedback and examples here.

Referenced discussion:

https://www.youtube.com/watch?v=PGNWo-UuXLs

1 Like

Glad you found our conversation useful. :victory_hand:

To my latest information Palantir PD is close to shipping official lightweight DuckDB bindings - maybe @jayad could share more concrete information.

For our use cases I provided a common reusable component to our devs that hooks into the ‚internal‘ pieces of the lightweight APIs - and I have the plan to move over to the official bindings once they are available!

Once that is done and there is interest I can open source the SQLFrame glue code to run existing pyspark code on duckdb.

1 Like

The official docs arrived :slight_smile:

https://www.palantir.com/docs/foundry/transforms-python/duckdb/

To get SQLFrame session you could use this helper function:

def get_sqlframe_duckdb_session(
    ddb_connection,
    enable_external_file_cache: bool = False,
) -> DuckDBSession:
    from sqlframe.duckdb.session import DuckDBSession
    from sqlglot.dialects.dialect import NormalizationStrategy
    from sqlglot.dialects.duckdb import DuckDB
    from sqlglot.dialects.spark import Spark

    # Set the normalization strategy to CASE_SENSITIVE to ensure that the column names are consistent
    # with the equivalent Spark code. Without this, column names get folded to lower case.
    Spark.NORMALIZATION_STRATEGY = NormalizationStrategy.CASE_SENSITIVE  # type: ignore
    DuckDB.NORMALIZATION_STRATEGY = NormalizationStrategy.CASE_SENSITIVE  # type: ignore

    )
    ddb_connection.execute(
        f"""
        SET preserve_insertion_order = false;
    """
    )

    session: "sqlframe.duckdb.session.DuckDBSession" = DuckDBSession(
        conn=ddb_connection
    )  # type: ignore

    return session
1 Like

Thank you for flagging this! We’ve been looking forward to this update and will be updating our internal libraries to these streamlined APIs. Additionally, thank you for sharing the SQLFrame example - this is an awesome idea for migrating transforms with low-impact to logic.

Awesome update!

Following up to acknowledge that documentation on creating unit tests for DuckDB transforms have also been added:

See:
Example data transform
Example test with validation against CSV files

https://www.palantir.com/docs/foundry/transforms-python/unit-tests