Using artifact image in code repository

Jacob_SE · October 22, 2024, 10:53am

Hi,

I am planning to use the “Unstructured IO” library to parse PDFs. However, I encountered dependency issues when trying to import the library normally, as described in the quick start guide.

As a workaround, I decided to use the Docker option and uploaded the Docker image to our artifact repository. The documentation mentions using a sidecar decoration, but I am unable to find instructions on how to use the Unstructured IO library in our code repository.

Could you please help me with this?

nicornk · October 22, 2024, 5:52pm

You could leverage the Bring your own container workflows from lightweight transform:

https://palantir.com/docs/foundry/transforms-python/lightweight-examples//

I do think you could make the unstructured library work in regular code repository as well. You will need to add the system dependencies through conda-forge. Worth giving it a shot.

Jacob_SE · October 24, 2024, 6:48am

Thank you for your response.

I will try the lightweight option.

However, I have already attempted to add dependencies through conda-forge, including the recommended libraries listed below, but I encountered issues. The system continuously prompted me to install additional libraries that were not listed. After installing those libraries, it still requested the same ones.

libmagic-dev: Essential for filetype detection.
poppler-utils: Needed for images and PDFs.
tesseract-ocr: Essential for images and PDFs.
libreoffice: For MS Office documents.
pandoc: For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version 2.14.2 or newer. Running this script will install the correct version for you.