I have an extremely large collection of word documents that I need to organize. I have little experience with the foundry API and as a result the first step, text extraction in python, has left me at a loss. I have tried a regular dataset, regular files, and a mediadataset. Unfortunately, my code is in a secure environment with restricted internet access. I will try and provide some examples, but I feel like text extraction should be simple.
Below is my overall process
Extract the text from each Word document using a library like docx or python-docx.
Create a pandas DataFrame with two columns: filename and text.
Apply preprocessing techniques using NLTK to the text
We’d recommend using media sets as they have many features out of the box that make workflows like this easy
You can create a media set by going to your project/folder, clicking ‘New’ and then selecting Media set. When creating the media set, you will want to set it to be a Document media set, and allow DOCX to be an Additional media input format. This means that when you upload the Word Document, it will get converted to a PDF, but you can still access the original file.
Thank you for the reply. Unfortunately, when I follow these steps I get an error message saying upload failed. I wrote a lot of code for UDFs up to this weekend but right now I’m limited to manually importing my data into code work space and then writing it to a dataframe
My steps:
Create New Media Set
Select Document
Select DOCX as additional media input format
Select Transactionless as write mode (I tried transactional)
Drag files into drop files box
Receive Error message: upload failed
Not that I can see. I’d like to resolve the issue. I did check that my files were in fact docx. Can I reach to an engineer or developer point to point and post the solution in here?
I’m facing a similar issue and have a couple of questions to add to the conversation.
When uploading .docx files to Foundry, is it possible to use a Foundry dataset instead of a media set? Additionally, if I want to preserve the original file extension and avoid converting it to a PDF, can this be achieved within a media set?
For complex Word documents, how does the built-in Palantir solution compare to the docx library in terms of functionality and performance? I currently have code that utilizes the docx library to extract text from Word documents, so I’m interested in understanding the pros and cons of each approach.
Thank you to Scott for bringing up this topic and allowing me to piggyback onto your question with some additional inquiries.