Hi,
I am planning to use the “Unstructured IO” library to parse PDFs. However, I encountered dependency issues when trying to import the library normally, as described in the quick start guide.
As a workaround, I decided to use the Docker option and uploaded the Docker image to our artifact repository. The documentation mentions using a sidecar decoration, but I am unable to find instructions on how to use the Unstructured IO library in our code repository.
Could you please help me with this?
You could leverage the Bring your own container workflows from lightweight transform:
https://palantir.com/docs/foundry/transforms-python/lightweight-examples//
I do think you could make the unstructured library work in regular code repository as well. You will need to add the system dependencies through conda-forge. Worth giving it a shot.
Thank you for your response.
I will try the lightweight option.
However, I have already attempted to add dependencies through conda-forge, including the recommended libraries listed below, but I encountered issues. The system continuously prompted me to install additional libraries that were not listed. After installing those libraries, it still requested the same ones.
- libmagic-dev: Essential for filetype detection.
- poppler-utils: Needed for images and PDFs.
- tesseract-ocr: Essential for images and PDFs.
- libreoffice: For MS Office documents.
- pandoc: For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version 2.14.2 or newer. Running this script will install the correct version for you.