I have PDFs and Word docs that may contain other PDFs, Word docs, and Excel files embedded in them. Is there a way to access any of these embedded files?
1 Like
Hi @Joel,
You should be able to do this programatically with PyMuPDF: https://pymupdf.readthedocs.io/en/latest/app2.html
import fitz
my_pdf=fitz.open("my.pdf")
my_pdf.embfile_names()
This should output the embedded files as a list ([‘embed_1.pdf’, ‘embed_2.pdf’, etc.)]) that you can then access with my_pdf.embfile_get(‘embed_1.pdf’), which will return a binary.
Refer to the Foundry code example on how to deal with raw files, to output this to a file: https://www.palantir.com/docs/foundry/code-examples/raw-file-parsing-transforms#copy-raw-files-between-datasets
Let me know if this works in your use case!
1 Like