Extracting tables from PDFs in a structured format

dmirza · June 3, 2024, 4:56am

Does anyone have code examples for using open source libraries to extract tables from PDFs, Word Docs into a structured dataset?

Alternatively, is there a good low-code way of doing this in PB (e.g. using LLM block to extract into a struct)

mtelling · June 3, 2024, 11:45am

Have you tried the OCR support on Media Sets?
I’m not sure how well this does with tables, but Pipeline Builder supports this and it should be easy to test.

bwolz · June 3, 2024, 2:01pm

If your PDF was digitally-born, there are ways to consistently extract PDF data using PyPDF2.

Link to PyPDF2

dmirza · June 4, 2024, 4:02am

Yep OCR should work well for extracting the text but specifically interested in keeping the same schema as the table in the PDF

215252359425d568f8c1 · January 28, 2025, 11:12am

Hi dmirza,
Have you found a solution to your question? I’m looking for the same thing thanks