Processing a set of media files through a pipeline causes memory error - any ideas?

tracy.adams · April 17, 2025, 8:19am

Background

We are uploading a large number of files to a media set. About 20K pages over 150 PDF documents. We want to upload a lot more.
Our pipeline converts files to text (w/OCR if necessary), paginates, chunks, creates embeddings and then a data set.

Situation

We uploaded the files to media set and started the pipeline
Our original chunk size was 512. In about 20 mintues, we got an "Module (ie-driver) ran out of memory error). Similar with chunks sizes of 725 and 861.
When we increased it to 1025, the pipeline completed in about 51 minutes.

What we’ve also tested

A large document - The largest document size is about 400 pages. This ran in about 2 minutes when it was the only file in the dataset
10 large documents - 10 400-page documents in the same dataset also ran in about 2 minutes. (Probably indicating some parallel processing)

We are hoping to learn the platforms capabilities and how to avoid these errors:

The error mesasges says “having a large number of tasks” could cause this problem. Is Palantir trying to process too many documents at once and is there a way to control this.
The error message also suggests modifying the driver memory by modifying the spark profile. I’m not sure we have access to that as we are on the Envision (Palantir managed military platform). Even so, we’d like to have an understanding of why this is happening as we’ll be uploading a lot more files.
How fast can we expect it to process (we assume Palantir limits the total available resources in some way)
If we exceed the limitations, will processing just be throttled? And can we expect it will eventually finish? Or will it timeout before it finishes
Is there a way to see progress details? There are some reports, but they are not as granular as we’d like about what step is taking time and progress within the step. We also can’t see what documents are being processed. (I’m thinking UIs on document upload/download apps that show you what files and how many bytes in each that have been processes.

jgreensmith · April 17, 2025, 2:01pm

Cool use case!

If you are using Pyspark code repo’s to do this, I would consider tweaking your memory profiles using the @configure decorator as per the docs here:
https://www.palantir.com/docs/foundry/optimizing-pipelines/spark-profiles-reference/

If you are using Pipeline Builder, you can set up memory profiles as per the docs here:
https://www.palantir.com/docs/foundry/pipeline-builder/management-build-settings

You can up the memory in either context to bump the memory in your Spark profile and hopefully avoid hitting memory issues!

Let us know if this helps and if you have any other questions here.

JG.

jgreensmith · April 17, 2025, 2:01pm

If you want to dive into the Spark details to see how your executors are failing and where the memory is being exceeded, check out the Spark UI in the build app for more details:
https://www.palantir.com/docs/foundry/optimizing-pipelines/spark-ui

mawad · April 17, 2025, 9:49pm

Hey! When you say chunk size what do you mean here?

tracy.adams · April 18, 2025, 9:45am

We extract the text and paginate. Then we divide the pages to chucks of 512 characters each.

system · June 17, 2025, 9:45am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.