Optimizing PDF Transformation Workflow: How to Process Only the Latest Uploaded Document in a Media Set?

Joel · March 5, 2025, 4:51pm

Update: more progress. After initializing a new transactionless media set, the incremental code was consistently failing with the “Media set output should be snapshotted” error. The fix was running the below, non-incremental code once, and then running the incremental code. I am new to incremental builds, and I don’t know why I had to run it non-incrementally first.

Up next is I’ll try to incrementally add only 100K of the input’s 4M PDFs. I’ll try an approach similar to this. Please comment if there’s any ideas here.

Non-incremental code:

from transforms.api import transform, Input  # , incremental  # configure, ComputeBackend
from transforms.mediasets import MediaSetOutput


# @configure(profile=["EXECUTOR_MEMORY_MEDIUM", "EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH"], backend=ComputeBackend.VELOX)
# @incremental(v2_semantics=True)
@transform(
    pdfs_dataset=Input("{Test dataset with 1 PDF}"),
    pdfs_media_set=MediaSetOutput(
        "ri.mio.main.media-set.543a1b44-5041-4fa1-af79-9bcd26e20110"
        # , should_snapshot=False
    )
)
def upload_to_media_set(pdfs_dataset, pdfs_media_set):
    pdfs_media_set.put_dataset_files(
        pdfs_dataset, ignore_items_not_matching_schema=False
    )

Incremental code:

from transforms.api import transform, Input , incremental  # configure, ComputeBackend
from transforms.mediasets import MediaSetOutput


# @configure(profile=["EXECUTOR_MEMORY_MEDIUM", "EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH"], backend=ComputeBackend.VELOX)
@incremental(v2_semantics=True)
@transform(
    pdfs_dataset=Input("{Test dataset with 10 PDFs}"),
    pdfs_media_set=MediaSetOutput(
        "ri.mio.main.media-set.543a1b44-5041-4fa1-af79-9bcd26e20110"
        , should_snapshot=False
    )
)
def upload_to_media_set(pdfs_dataset, pdfs_media_set):
    pdfs_media_set.put_dataset_files(
        pdfs_dataset, ignore_items_not_matching_schema=False
    )