PDF Processing (Document Split, Rotate document in each line) in pipeline builder

# Palantir Foundry Pipeline Builder - PDF Processing Questions

Question 1: PDF Document Splitting

I have a 300-page PDF that needs to be split into 6 separate documents (50 pages each).

Current Situation

  • I can do this in Code Repository using Python
  • I cannot find how to do this in Pipeline Builder

Example

Input: contract_document.pdf (300 pages)

Desired Output:

  • contract_part1.pdf (pages 1-50)
  • contract_part2.pdf (pages 51-100)
  • contract_part3.pdf (pages 101-150)
  • contract_part4.pdf (pages 151-200)
  • contract_part5.pdf (pages 201-250)
  • contract_part6.pdf (pages 251-300)

Question

Is there a built-in transform in Pipeline Builder to split PDFs by page ranges?


Question 2: Rotating PDF Pages Based on Row Data

I have a dataset where each row contains a PDF path and rotation degree.

Example Dataset

document_id pdf_path page_number rotation_degree
DOC001 /files/scan1.pdf 1 90
DOC001 /files/scan1.pdf 2 0
DOC001 /files/scan1.pdf 3 270
DOC002 /files/scan2.pdf 1 180

Use Case

Scanned documents are often incorrectly rotated. I need to:

  1. Read rotation degree from each row
  2. Rotate the corresponding PDF page
  3. Output corrected documents

Question

Can I rotate PDF pages dynamically based on row-level rotation values in Pipeline Builder?


Summary

I’m looking for:

  1. PDF splitting functionality in Pipeline Builder
  2. Dynamic page rotation based on metadata

I want to show our team that complex document processing can be done in Pipeline Builder without always using Code Repository.

Are these features available? Which transforms should I use?

Problem 1.
There is a partial solution, which might still push you to code repository for the time being.
If the ranges are static (seems to be your case) then you can use the “slice PDF” transform. You will however hit a problem with unions, as you will need to have multiple outputs (one for each range).

Note that if you want to split “page by page” (like not a range, but single page), there is a dedicated transform.

In code repository, an equivalent code would be:

@transform(
    media_input=MediaSetInput("/path/to/multi page"),
    media_out=MediaSetOutput("/path/to/single page pdf"),
)
def compute(ctx, media_input: MediaSetInput, media_out: MediaSetOutput):
    def split_pdf(media_item_rid):
        metadata = media_input.get_media_item_metadata(media_item_rid)
        pages = metadata.document.pages
        if pages is None:
            return ""

        for page in range(pages):
            response = media_input.transform_media_item(media_item_rid, str(page), {
                "type": "documentToDocument",
                "documentToDocument": {
                "encoding": {
                    "type": "pdf",
                    "pdf": {}
                },
                "operation": {
                    "type": "slicePdfRange",
                    "slicePdfRange": {
                    "startPageInclusive": page, # zero indexed
                    "endPageExclusive": page + 1 
                    }
                }
                }
            }
            )
            media_out.put_media_item(response, str(page))

    split_pdf("ri.mio.main.media-item.1234567...TOCHANGE")

This code can be adapted for your case of page ranges.

Problem 2.

I don’t think there is an option to rotate PDFs as of today. However, you can rotate pictures. So you can convert the PDF to pictures and then rotate.

More broadly about rotation: depending of your use-case, this might not even be necessary. For example, running a layout model on a document is executed post-rotation already.

1 Like

Thank you for your reply.

Hope this feature effective soon.