Does anyone have code examples for using open source libraries to extract tables from PDFs, Word Docs into a structured dataset?
Alternatively, is there a good low-code way of doing this in PB (e.g. using LLM block to extract into a struct)
Does anyone have code examples for using open source libraries to extract tables from PDFs, Word Docs into a structured dataset?
Alternatively, is there a good low-code way of doing this in PB (e.g. using LLM block to extract into a struct)
Have you tried the OCR support on Media Sets?
Iām not sure how well this does with tables, but Pipeline Builder supports this and it should be easy to test.
If your PDF was digitally-born, there are ways to consistently extract PDF data using PyPDF2.
Link to PyPDF2
Yep OCR should work well for extracting the text but specifically interested in keeping the same schema as the table in the PDF