I’m using Foundry to manage a dataset where users upload files via a Workshop app. The dataset is the input for a Pipeline Builder pipeline that outputs an ObjectSet.
What I Did:
- Set the dataset ingestion mode to Snapshot Replace because I want only the latest uploaded file to be present in the dataset after each upload.
- Configured a schedule to run the pipeline when the dataset is updated.
- Uploaded a new file via the file uploader.
Problem:
Even after setting the ingestion mode to Snapshot Replace, when I upload a new file, the old file remains in the dataset alongside the new one. My expectation was that the dataset would only contain the new file, but both the old and new files are present, and the pipeline processes both.
Question:
Is there something I’m missing in the configuration of Snapshot Replace mode? How can I ensure that only the most recently uploaded file is present in the dataset and processed by the pipeline?
Thanks in advance for any guidance!
Are you perhaps referring to the Snapshot Replace “Backing dataset write mode” in Pipeline Builder?
If so, it looks like you might be conflating a few different things here. The backing dataset of the Object Type (which Pipeline Builder automatically creates when you have an Object Type output) is separate from the input dataset. Since you’re updating the input dataset from Workshop using (I assume) the Media Uploader widget, and the Media Uploader widget performs an APPEND transaction, the input dataset will always include all of the files ever uploaded.
It sounds like your ultimate goal here is for the Object Type (= the Object Type’s backing dataset) to only contain data from the latest file. The simplest way to do this is the following.
- Set the input computation mode to Incremental. This will ensure that your pipeline only processes files that were added after the last run (this setting isn’t strictly necessary because of the filter in the next step, but it will give you better performance).
- Because it’s possible that multiple files might have been added since the last pipeline run (if the schedule was paused, etc.), ensure that your pipeline logic filters to only records from the latest file. This is straightforward to achieve with the following logic (assuming that the input dataset is a CSV and you added the
_importedAt
special column in the dataset schema configuration).
- Configure “Snapshot only new rows” with
_filePath
as the primary key for the backing dataset write mode. Note that the “primary key” label here is a bit imprecise, because the column specified here isn’t really a primary key, but rather just the key used in the anti-join. The anti-join itself doesn’t do anything useful in this particular case - there’s just no way to avoid it when using “Snapshot only new rows,” which is the only write mode that achieves the behavior you want here. Technically, you could specify any column that you know will not have any overlap with the existing output data; _filePath
is just the most convenient / natural one to use.
2 Likes
Thank you for your reply, I will go and try those appraoches,
will reply back with the most convinient .