Does anyone have code examples for using open source libraries to extract tables from PDFs, Word Docs into a structured dataset?
Alternatively, is there a good low-code way of doing this in PB (e.g. using LLM block to extract into a struct)
Does anyone have code examples for using open source libraries to extract tables from PDFs, Word Docs into a structured dataset?
Alternatively, is there a good low-code way of doing this in PB (e.g. using LLM block to extract into a struct)
Have you tried the OCR support on Media Sets?
I’m not sure how well this does with tables, but Pipeline Builder supports this and it should be easy to test.
If your PDF was digitally-born, there are ways to consistently extract PDF data using PyPDF2.
Link to PyPDF2
Yep OCR should work well for extracting the text but specifically interested in keeping the same schema as the table in the PDF
Hi dmirza,
Have you found a solution to your question? I’m looking for the same thing thanks