Pipeline Optimization: Dataset Projection: how do they work and how they relate to Hive partitioning/bucketing

Hi Arnau,

Those are great questions.

Join-optimized projections are based on bucketing. Filter-optimized projections use a proprietary Palantir file format that is especially advantageous over Hive partitioning when you need to filter on high-cardinality columns like timestamps. Both join and filter-optimized projections automatically compact when the amount of files reach a certain point, which is another advantage of filter-optimized projections over Hive partitioning for incremental pipelines.

Bucketing and hive partitioning are also supported in Foundry! For how to do it in e.g. Python transforms, see the partition_cols, bucket_cols, bucket_count, and sort_by methods on the write_dataframe method of TransformOutput. However, an important caveat is that bucketing does not work unless you are writing to the output with a SNAPSHOT transaction. To achieve bucketing with an APPEND-based incremental pipeline, your only option is to use a join-optimized projection.

To boil the above points down into some concrete heuristics:

  1. If you are writing to a dataset via APPEND (or additive UPDATE) transactions: join-optimized projections, filter-optimized projections, and Hive partitioning are reasonable options (but beware of the count of output files increasing dramatically over time in the case of Hive partitioning, which can impact performance of consumers).

  2. If you are writing to a dataset via SNAPSHOT transactions: you should generally just use bucketing or hive partitioning, but there may be some cases where a filter-optimized projection is useful here.

Regarding the behavior of subsequent transforms - yes, they will pick one or the other depending on the operation and columns involved. I believe that if you reference the input multiple times in the same transform, it is possible for different projections to be used each time, though I must confess that I’m not 100% sure about that (perhaps someone else can test the behavior there and chime in).

4 Likes