Optimizing PDF Transformation Workflow: How to Process Only the Latest Uploaded Document in a Media Set?

Joel · March 3, 2025, 4:42am

Following up on this, is there a way to incrementally build a media set? I just gave it a try and received this error, but I’m not sure if there’s another method I should try:

Full code:

from transforms.api import transform, Input, incremental  # configure, ComputeBackend
from transforms.mediasets import MediaSetOutput


# @configure(profile=["EXECUTOR_MEMORY_MEDIUM", "EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH"], backend=ComputeBackend.VELOX)
@incremental()
@transform(
    pdfs_dataset=Input("ri.foundry.main.dataset.acffb422-8df5-43a4-b26e-541ba8965715"),
    pdfs_media_set=MediaSetOutput(
        "ri.mio.main.media-set.6c60b58f-1186-4e54-bbdd-8fe57c6b569f"
    ),
)
def upload_to_media_set(pdfs_dataset, pdfs_media_set):
    pdfs_media_set.put_dataset_files(
        pdfs_dataset, ignore_items_not_matching_schema=False
    )

Thanks,
Joel