An incremental, schemaless dataset of PDFs can be incrementally transformed into a media set in code repo. I pasted my code here (https://community.palantir.com/t/need-help-with-parallelization-when-using-filesystem/3238/8?u=joel) when an incremental dataset that already had 4M PDFs and counting needed to be converted incrementally to a media set and then had text extracted in pipeline builder. The pipeline builder step doesn’t technically support incremental media sets, but setting the text extraction transform to skip recomputing rows worked using raw text extraction on 10K PDFs per build.
3 Likes