I’m on a developer instance currently trying to extract text from a PDF using Layout Aware. Most attempts fail with:
Query failed to complete successfully: {jobId=cebfbe1b-12b9-4862-9a27-104fcc55c23f, errorInstanceId=aa9bb062-7705-4f95-b8b7-1934a0f39e57, errorCode=500, errorName=Default:Internal, causeMessage=org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 390.0 failed 4 times, most recent failure: Lost task 0.3 in stage 390.0 (TID 577) (10.0.120.44 executor 1): com.google.common.util.concurrent.UncheckedExecutionException: com.palantir.conjure.java.api.errors.RemoteException: RemoteException: INTERNAL (Transformation:TransformationInternalError) with instance ID aa9bb062-7705-4f95-b8b7-1934a0f39e57
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1383)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at com.palantir.common.streams.BufferingSpliterator.tryAdvance(BufferingSpliterator.java:57)
at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:292)
at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:298)
at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$2(Executor.scala:633)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:636)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
I’ve had ONE success taking a single page out of the PDF and getting text back, but I can’t get multiple pages. I’m wondering, is this a resource constraint issue? I’m on a developer instance trying to prove out a workflow. The errors are too generic to understand how to help fix the issue.
Using just plain “OCR” mode works, but I’m needing to extract where layout is important to parse out because the contents have lots of callouts, sidebars, and other visuals that don’t follow a simple flow.
Hi @kevcam4891 thank you for reporting the error. To help us investigate the issue, could you retrigger the error to get a new errorInstanceId, and share the exact timestamp at which the error occurred?
Query failed to complete successfully: {jobId=1cb9ef2e-e22d-43c0-8493-dcf8f13f1fae, errorInstanceId=3df168c6-5489-403d-a10a-e270378729a4, errorCode=500, errorName=Default:Internal, causeMessage=org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 23) (10.0.164.54 executor 2): com.google.common.util.concurrent.UncheckedExecutionException: com.palantir.conjure.java.api.errors.RemoteException: RemoteException: INTERNAL (Transformation:TransformationInternalError) with instance ID 3df168c6-5489-403d-a10a-e270378729a4
at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1383)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at com.palantir.common.streams.BufferingSpliterator.tryAdvance(BufferingSpliterator.java:57)
at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:292)
at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:298)
at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$2(Executor.scala:633)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:636)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.palantir.conjure.java.api.errors.RemoteException: RemoteException: INTERNAL (Transformation:TransformationInternalError) with instance ID 3df168c6-5489-403d-a10a-e270378729a4
at com.palantir.conjure.java.dialogue.serde.ExceptionDeserializingErrorDecoder.createRemoteException(ExceptionDeserializingErrorDecoder.java:194)
at com.palantir.conjure.java.dialogue.serde.ExceptionDeserializingErrorDecoder.decodeInternal(ExceptionDeserializingErrorDecoder.java:168)
at com.palantir.conjure.java.dialogue.serde.ExceptionDeserializingErrorDecoder.decode(ExceptionDeserializingErrorDecoder.java:103)
at com.palantir.conjure.java.dialogue.serde.ErrorDecoder.decode(ErrorDecoder.java:40)
at com.palantir.dialogue.annotations.ConjureErrorDecoder.decode(ConjureErrorDecoder.java:30)
at com.palantir.dialogue.annotations.ErrorHandlingDeserializerFactory$1.deserialize(ErrorHandlingDeserializerFactory.java:47)
at com.palantir.dialogue.futures.DialogueDirectTransformationFuture.onSuccess(DialogueDirectTransformationFuture.java:109)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
at com.palantir.dialogue.futures.SafeDirectExecutor.execute(SafeDirectExecutor.java:32)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1004)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:767)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:491)
at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:48)
at com.palantir.dialogue.futures.DialogueDirectTransformationFuture.onSuccess(DialogueDirectTransformationFuture.java:110)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
at com.palantir.dialogue.futures.SafeDirectExecutor.execute(SafeDirectExecutor.java:32)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1004)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:767)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:491)
at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:122)
at com.palantir.dialogue.futures.SafeDirectExecutor.execute(SafeDirectExecutor.java:32)
at com.google.common.util.concurrent.MoreExecutors.lambda$rejectionPropagatingExecutor$0(MoreExecutors.java:1063)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1004)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:767)
at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:560)
at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.setResult(AbstractTransformFuture.java:237)
at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.setResult(AbstractTransformFuture.java:213)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:172)
at com.palantir.dialogue.futures.SafeDirectExecutor.execute(SafeDirectExecutor.java:32)
at com.google.common.util.concurrent.MoreExecutors.lambda$rejectionPropagatingExecutor$0(MoreExecutors.java:1063)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1004)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:767)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:491)
at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:48)
at com.palantir.dialogue.blocking.BlockingChannelAdapter$BlockingChannelAdapterChannel$BlockingChannelAdapterTask.run(BlockingChannelAdapter.java:141)
at com.palantir.tracing.Tracers$TracingAwareRunnable.run(Tracers.java:617)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at com.palantir.tritium.metrics.TaggedMetricsThreadFactory$InstrumentedTask.run(TaggedMetricsThreadFactory.java:94)
... 1 more
Suppressed: com.palantir.conjure.java.dialogue.serde.ExceptionDeserializingErrorDecoder$ResponseDiagnostic: Response Diagnostic Information: {status=500, Server=envoy, Content-Type=application/json, Content-Length=154, Date=Sun, 23 Nov 2025 18:26:00 GMT, Response-Flags=-, Response-Code-Details=via_upstream}
According to logs, this error occurred because the document extraction model has a limit of 16MB on the size of input media. You can work around it by reducing the size of your media items or splitting each PDF into multiple media items.
We will look into improving the error reporting to show a more helpful error message.
Thanks @yixunx ! Bummer about the file size. I unfortunately have cover pages that are like 20mb a piece. Most pages are in the 2-3MB range, but some are doozies. I was hoping to use Foundry to extract to keep everything in the same platform. I might be able to just use a separate service like llama index or something to parse, then pull into the platform.