I am developing a workshop where users can upload PDF documents, which are then processed and transformed in the backend. Currently, when a PDF is uploaded, it is placed into a media set, and the entire media set undergoes transformation. This approach is both time-consuming and resource-intensive.
I am seeking advice on how to modify this workflow so that only the most recently uploaded PDF is transformed, rather than reprocessing the entire media set. Given that media sets do not support incremental pipeline builds, I am looking for suggestions on how to achieve more efficient, incremental processing.
Hey @etuhabonye the media set team will be working on adding incremental media set support in the coming weeks.
In the meantime potentially you could upload the new sets of documentations separately and union those together. I’ll also let the media set team comment on the above and see if there is another workaround that’s more preferred
Following up on this, is there a way to incrementally build a media set? I just gave it a try and received this error, but I’m not sure if there’s another method I should try:
That worked on a small, test input dataset. Now I need to find a solution for my larger dataset. (I’m testing next on 20K PDFs; the full dataset is ~4M PDFs.) The code needs to be updated because the larger input is too big for the 10K limit per transaction:
Update: more progress. After initializing a new transactionless media set, the incremental code was consistently failing with the “Media set output should be snapshotted” error. The fix was running the below, non-incremental code once, and then running the incremental code. I am new to incremental builds, and I don’t know why I had to run it non-incrementally first.
Up next is I’ll try to incrementally add only 100K of the input’s 4M PDFs. I’ll try an approach similar to this. Please comment if there’s any ideas here.