I’m working on an existing incremental pipeline by using foundry branching. On my branch I’m trying to add a new output to an existing python transform that has the incremental decorator. I’ve added my new output as a MediaSetOutput, but when I build I receive this error ValueError: Media set output should be snapshotted, but is not configured to do so. Resolved by setting "should_snapshot=True".
If I add the should_snapshot=True to my MediaSetOutput I instead get the error Failed to start transactions on output datasets: MediaSet:CannotSnapshotNonTransactionalMediaSet {}.
If I add should_snapshot=False to my MediaSetOutput I instead get the the same Media set output should be snapshotted error.
If I visit my mediaset directly and delete the jobspec on my branch, then try to upload a file manually it seems that it gets uploaded to the master branch, not my branch.
I have seen this other post where Joel appeared to be able to get his incremental transform working with a transactionless mediaset by first running the transform without the incremental decorator. However I don’t think this is an option for us as we cannot modify the existing transform to take a snapshot.
Does anyone know how I can use a transactionless mediaset output on a branched incremental python transform? Is this behaviour actually unsupported?
Hi @BenjaminG, I banged my head against the wall with these exact same error messages. I don’t know why this happens with transactionless media sets, and I switched to transactional media sets for this exact reason. Why are you using transactionless media sets? Would it be possible to use transactional instead?
Edit: This post linked below has my updated code that incrementally ran on 4M PDFs (took a few weeks running continuously with a build schedule that used downstream outputs as triggers until it detected no new rows, but it worked). Also potentially helpful, incremental batch size limits could be used to simplify how I batched 10K rows at a time (transactional media set transaction size limit): Limit batch size of incremental inputs
@Joel Thanks for the response, good to know the experience is shared…
I have been trying to use a transactionless mediasets to avoid the need to limit the work done to 10k files, as similarly I am working with a filesystem dataset containing millions of documents.
FWIW for anyone at Palantir who can work on these features - at least for our use case, it would be totally fine if the transactionless mediaset functioned essentially like blob storage, without incremental or branching features.
I will explore working with a transactional mediaset - though one problem I’ve found already is that if I build on a branch and my job fails (ie nothing is written to the mediaset) then that mediaset seems to become unusable on that branch - I get the error transforms._errors.RequiredIncrementalTransform: ('Require incremental set to true, but cannot run incrementally. Reason: %s', 'an output to this build has been altered since the last time the build was run') despite the output mediaset having no files on the branch I’m working on (the history tab on the dataset is empty).
I don’t know why this happens with transactionless media sets, and I switched to transactional media sets for this exact reason
I asked AIP about this and it gave me an answer that sort of made sense:
Transactionless mediasets are effectively unusable as outputs of incremental transforms on branches in Foundry. This is because, on a new branch, the first build of an incremental transform requires a snapshot, but transactionless mediasets do not support snapshotting—they only support the “modify” write mode. As a result, any attempt to use a transactionless mediaset as an output in this scenario will fail. For other use cases (such as non-incremental transforms or direct uploads), mediasets may still work on branches, but for incremental transforms, transactionless mediasets are not supported on branches due to this limitation