Hello everyone,
There might be some crucial setup step I missed, but as far as I can tell right now, there is a design flaw in how projections are used in relation to schedules.
Let’s assume the following: I have a dataset with a region column. There is a set of downstream datasets that use this dataset as input and filter on a subset of regions. So, I have setup a projection on the source dataset with a filter on “region”. I have put the projection on a schedule to immediately build once the base dataset updates.
Now, all downstream datasets have a trigger on the original dataset to start building, when the input dataset has a committed transaction. All of these builds would largely benefit from the projection. However, since both the projection and the downstream datasets are triggered in parallel, the projection does not provide any benefit, because it is out of date.
Request: Could we tie the schedule “trigger” to a projection build completion instead of the actual dataset transaction commit? This way we can ensure to only trigger all the downstream datasets once the projection has been updated. This would make our pipelines much more efficient and put projections to a much better use in connecting schedules that are transaction triggered and not time triggered.
Also, smaller request: In the build of a dataset, could you please show in the UI if the build is hitting a projection or the dataset itself. That would make debugging and troubleshooting much more efficient? (This could be similar to how “incremental” vs. “non-incremental” build status are shown …