Is it possible to access files embedded into a PDF or Word document?

I have PDFs and Word docs that may contain other PDFs, Word docs, and Excel files embedded in them. Is there a way to access any of these embedded files?

1 Like

Hi @Joel,

You should be able to do this programatically with PyMuPDF: https://pymupdf.readthedocs.io/en/latest/app2.html

import fitz
my_pdf=fitz.open("my.pdf")
my_pdf.embfile_names()

This should output the embedded files as a list ([‘embed_1.pdf’, ‘embed_2.pdf’, etc.)]) that you can then access with my_pdf.embfile_get(‘embed_1.pdf’), which will return a binary.

Refer to the Foundry code example on how to deal with raw files, to output this to a file: https://www.palantir.com/docs/foundry/code-examples/raw-file-parsing-transforms#copy-raw-files-between-datasets

Let me know if this works in your use case!

1 Like