Extracting tables from PDFs in a structured format

Does anyone have code examples for using open source libraries to extract tables from PDFs, Word Docs into a structured dataset?

Alternatively, is there a good low-code way of doing this in PB (e.g. using LLM block to extract into a struct)

Have you tried the OCR support on Media Sets?
Iā€™m not sure how well this does with tables, but Pipeline Builder supports this and it should be easy to test.

If your PDF was digitally-born, there are ways to consistently extract PDF data using PyPDF2.

Link to PyPDF2

1 Like

Yep OCR should work well for extracting the text but specifically interested in keeping the same schema as the table in the PDF