Text Extraction From Word DOCs in a foundry dataset

scottm · March 7, 2025, 6:21pm

I have an extremely large collection of word documents that I need to organize. I have little experience with the foundry API and as a result the first step, text extraction in python, has left me at a loss. I have tried a regular dataset, regular files, and a mediadataset. Unfortunately, my code is in a secure environment with restricted internet access. I will try and provide some examples, but I feel like text extraction should be simple.

Below is my overall process

Extract the text from each Word document using a library like docx or python-docx.
Create a pandas DataFrame with two columns: filename and text.
Apply preprocessing techniques using NLTK to the text

Isy · March 10, 2025, 11:50am

Hi,

We’d recommend using media sets as they have many features out of the box that make workflows like this easy

You can create a media set by going to your project/folder, clicking ‘New’ and then selecting Media set. When creating the media set, you will want to set it to be a Document media set, and allow DOCX to be an Additional media input format. This means that when you upload the Word Document, it will get converted to a PDF, but you can still access the original file.

You should then be able to write a python transform that extracts the text, but following the docs here. Alternatively, you could use Pipeline Builder to do this instead.

You can also access metadata (eg. filename) in both python transforms and Pipeline Builder, so you can create your two columns.

Hopefully that helps!

scottm · March 10, 2025, 2:03pm

Hi Isy,

Thank you for the reply. Unfortunately, when I follow these steps I get an error message saying upload failed. I wrote a lot of code for UDFs up to this weekend but right now I’m limited to manually importing my data into code work space and then writing it to a dataframe

My steps:

Create New Media Set
Select Document
Select DOCX as additional media input format
Select Transactionless as write mode (I tried transactional)
Drag files into drop files box
Receive Error message: upload failed

Isy · March 10, 2025, 2:04pm

Do you get any other error messages in addition to the ‘Upload failed’ messaged?

scottm · March 10, 2025, 7:30pm

Isy,

Not that I can see. I’d like to resolve the issue. I did check that my files were in fact docx. Can I reach to an engineer or developer point to point and post the solution in here?

Scott

jenny · March 11, 2025, 1:44am

Hi Isy,

I’m facing a similar issue and have a couple of questions to add to the conversation.

When uploading .docx files to Foundry, is it possible to use a Foundry dataset instead of a media set? Additionally, if I want to preserve the original file extension and avoid converting it to a PDF, can this be achieved within a media set?
For complex Word documents, how does the built-in Palantir solution compare to the docx library in terms of functionality and performance? I currently have code that utilizes the docx library to extract text from Word documents, so I’m interested in understanding the pros and cons of each approach.

Thank you to Scott for bringing up this topic and allowing me to piggyback onto your question with some additional inquiries.

Best regards,
Jenny

system · May 10, 2025, 1:44am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.