Background
- We are uploading a large number of files to a media set. About 20K pages over 150 PDF documents. We want to upload a lot more.
- Our pipeline converts files to text (w/OCR if necessary), paginates, chunks, creates embeddings and then a data set.
Situation
- We uploaded the files to media set and started the pipeline
- Our original chunk size was 512. In about 20 mintues, we got an "Module (ie-driver) ran out of memory error). Similar with chunks sizes of 725 and 861.
- When we increased it to 1025, the pipeline completed in about 51 minutes.
What we’ve also tested
- A large document - The largest document size is about 400 pages. This ran in about 2 minutes when it was the only file in the dataset
- 10 large documents - 10 400-page documents in the same dataset also ran in about 2 minutes. (Probably indicating some parallel processing)
We are hoping to learn the platforms capabilities and how to avoid these errors:
-
The error mesasges says “having a large number of tasks” could cause this problem. Is Palantir trying to process too many documents at once and is there a way to control this.
-
The error message also suggests modifying the driver memory by modifying the spark profile. I’m not sure we have access to that as we are on the Envision (Palantir managed military platform). Even so, we’d like to have an understanding of why this is happening as we’ll be uploading a lot more files.
-
How fast can we expect it to process (we assume Palantir limits the total available resources in some way)
-
If we exceed the limitations, will processing just be throttled? And can we expect it will eventually finish? Or will it timeout before it finishes
-
Is there a way to see progress details? There are some reports, but they are not as granular as we’d like about what step is taking time and progress within the step. We also can’t see what documents are being processed. (I’m thinking UIs on document upload/download apps that show you what files and how many bytes in each that have been processes.