PDF Processing (Document Split, Rotate document in each line) in pipeline builder

Jacob_SE · November 10, 2025, 6:33am

# Palantir Foundry Pipeline Builder - PDF Processing Questions

Question 1: PDF Document Splitting

I have a 300-page PDF that needs to be split into 6 separate documents (50 pages each).

Current Situation

I can do this in Code Repository using Python
I cannot find how to do this in Pipeline Builder

Example

Input: contract_document.pdf (300 pages)

Desired Output:

contract_part1.pdf (pages 1-50)
contract_part2.pdf (pages 51-100)
contract_part3.pdf (pages 101-150)
contract_part4.pdf (pages 151-200)
contract_part5.pdf (pages 201-250)
contract_part6.pdf (pages 251-300)

Question

Is there a built-in transform in Pipeline Builder to split PDFs by page ranges?

Question 2: Rotating PDF Pages Based on Row Data

I have a dataset where each row contains a PDF path and rotation degree.

Example Dataset

document_id	pdf_path	page_number	rotation_degree
DOC001	/files/scan1.pdf	1	90
DOC001	/files/scan1.pdf	2	0
DOC001	/files/scan1.pdf	3	270
DOC002	/files/scan2.pdf	1	180

Use Case

Scanned documents are often incorrectly rotated. I need to:

Read rotation degree from each row
Rotate the corresponding PDF page
Output corrected documents

Question

Can I rotate PDF pages dynamically based on row-level rotation values in Pipeline Builder?

Summary

I’m looking for:

PDF splitting functionality in Pipeline Builder
Dynamic page rotation based on metadata

I want to show our team that complex document processing can be done in Pipeline Builder without always using Code Repository.

Are these features available? Which transforms should I use?

VincentF · November 10, 2025, 7:44am

Problem 1.
There is a partial solution, which might still push you to code repository for the time being.
If the ranges are static (seems to be your case) then you can use the “slice PDF” transform. You will however hit a problem with unions, as you will need to have multiple outputs (one for each range).

Note that if you want to split “page by page” (like not a range, but single page), there is a dedicated transform.

In code repository, an equivalent code would be:

@transform(
    media_input=MediaSetInput("/path/to/multi page"),
    media_out=MediaSetOutput("/path/to/single page pdf"),
)
def compute(ctx, media_input: MediaSetInput, media_out: MediaSetOutput):
    def split_pdf(media_item_rid):
        metadata = media_input.get_media_item_metadata(media_item_rid)
        pages = metadata.document.pages
        if pages is None:
            return ""

        for page in range(pages):
            response = media_input.transform_media_item(media_item_rid, str(page), {
                "type": "documentToDocument",
                "documentToDocument": {
                "encoding": {
                    "type": "pdf",
                    "pdf": {}
                },
                "operation": {
                    "type": "slicePdfRange",
                    "slicePdfRange": {
                    "startPageInclusive": page, # zero indexed
                    "endPageExclusive": page + 1 
                    }
                }
                }
            }
            )
            media_out.put_media_item(response, str(page))

    split_pdf("ri.mio.main.media-item.1234567...TOCHANGE")

This code can be adapted for your case of page ranges.

Problem 2.

I don’t think there is an option to rotate PDFs as of today. However, you can rotate pictures. So you can convert the PDF to pictures and then rotate.

More broadly about rotation: depending of your use-case, this might not even be necessary. For example, running a layout model on a document is executed post-rotation already.

Jacob_SE · November 10, 2025, 8:21am

Thank you for your reply.

Hope this feature effective soon.

system · February 9, 2026, 4:22pm

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.