Incremental Pipeline w/ Media Sets and Catalog Datasets Limitation

I am appending from my file sync pdf files from a SharePoint site. The data ingested is a list of file paths (no schema) and I get several hundred of these a day… Converting these to Media Sets Incrementally doesn’t seem like it can be done.

There needs to be some piece of metadata… such as “ingest date” to allow these pdf file paths to be converted to a media set, in addition this can only be done in a code repo, so it is causing me to bounce back and forth between pipeline builder and code repo

In order to work around this… I decided to snapshot new records based on these filters (below) in my data sync, so now I am missing data but is enough to get the general feel of how my use case should work. In the sync there should also be a filter that is current date because now based on this work around I at least once a month will need to change this “Last Modified After” value.

Hi @mfannin, are you syncing your pdf files into a schema-less dataset because media set syncs are currently not supported for SharePoint sources? We are actively working on supporting media set syncs on more sources (including SharePoint), which should address this feature gap you’re experiencing.

I think that would work… to get me by the first issue, that would give me a mediaset input and output into a repo…

But then to extract text from the pdf (using OCR)… in pipeline builder I was running into issues trying to switch that to incremental… can that only be done in a code repo ?

Yes, we don’t support incremental transforms in pipeline builder yet (it’s in the works but not GA’ed). Currently you can only run incremental transforms in Code Repositories.

An incremental, schemaless dataset of PDFs can be incrementally transformed into a media set in code repo. I pasted my code here (https://community.palantir.com/t/need-help-with-parallelization-when-using-filesystem/3238/8?u=joel) when an incremental dataset that already had 4M PDFs and counting needed to be converted incrementally to a media set and then had text extracted in pipeline builder. The pipeline builder step doesn’t technically support incremental media sets, but setting the text extraction transform to skip recomputing rows worked using raw text extraction on 10K PDFs per build.

3 Likes

I will take a look. Thank you very much !