I need to process a set of PDFs for LLM workflows. The PDFs are organized in a hierarchical folder structure. Can I sync all these to the same mediaset? I believe there are limits to how many media can exist in a mediaset, if so what is the best way to ingest these PDFs
Yes, you can sync all of these to the same media set. The only requirement for items uploaded to a media set is that they must all conform to the same schema, in this case that is is that they are all PDFs.
The best way to get items into a media set is via data connection. There is some documentation here that should walk you through that.
There is no limit to the number of items that you can upload to a media set, but there is a limit of 10,000 items that can be uploaded within a single transaction. If you go the data connection route, this will be handled for you. If you end up implementing your own solution to upload items, then it’s something you will have to to be cognizant of.
This document doesnt say how to create the “Use a regular expression to match file paths”
I have a bucket called “databucket” where the PDF’s are in /Data/PDF how to I create a sync job ?