Pipeline Optimization: Dataset Projection: how do they work and how they relate to Hive partitioning/bucketing

Arnau · June 6, 2024, 10:12am

Hello,

I would like to understand several aspects of dataset projections in order to know what should I use to optimize my pipelines:

Technology foundation: I assume projections are based on Hive partitioning/bucketing, is this correct?
How it works: We can create many different projections for joining or filtering using different configurations.
- A subsequent transform will pick one or the other depending on the operation it is implemented (filtering/joining) and columns involved?
- A subsequent transform only a single projection will be taken?
Best practices: In the current Foundry documentation I could not find information about Hive partition/bucketing. Does this mean that for such use cases the best practices would be to use projections always?

Many thanks for the clarifications on this topic.

sandpiper · June 7, 2024, 8:01am

Hi Arnau,

Those are great questions.

Join-optimized projections are based on bucketing. Filter-optimized projections use a proprietary Palantir file format that is especially advantageous over Hive partitioning when you need to filter on high-cardinality columns like timestamps. Both join and filter-optimized projections automatically compact when the amount of files reach a certain point, which is another advantage of filter-optimized projections over Hive partitioning for incremental pipelines.

Bucketing and hive partitioning are also supported in Foundry! For how to do it in e.g. Python transforms, see the partition_cols, bucket_cols, bucket_count, and sort_by methods on the write_dataframe method of TransformOutput. However, an important caveat is that bucketing does not work unless you are writing to the output with a SNAPSHOT transaction. To achieve bucketing with an APPEND-based incremental pipeline, your only option is to use a join-optimized projection.

To boil the above points down into some concrete heuristics:

If you are writing to a dataset via APPEND (or additive UPDATE) transactions: join-optimized projections, filter-optimized projections, and Hive partitioning are reasonable options (but beware of the count of output files increasing dramatically over time in the case of Hive partitioning, which can impact performance of consumers).
If you are writing to a dataset via SNAPSHOT transactions: you should generally just use bucketing or hive partitioning, but there may be some cases where a filter-optimized projection is useful here.

Regarding the behavior of subsequent transforms - yes, they will pick one or the other depending on the operation and columns involved. I believe that if you reference the input multiple times in the same transform, it is possible for different projections to be used each time, though I must confess that I’m not 100% sure about that (perhaps someone else can test the behavior there and chime in).