Multiple Column PDF Document Parsing

michelle · October 14, 2024, 4:22pm

I have a media set which contains many documents. The documents are a mix of scientific papers, theses etc. Each document might have 1 or many columns of text. They also have images / charts (i can skip processing those for now). Does anyone have any suggestions on how to extract the document text, keeping the relevant text together?

jvelayvitow · October 16, 2024, 12:06am

My first thought is that if the double column text extraction captures the white space, one could run a split string on the output and then pair the even entries and the odd entries to sort it back into the proper format. This depends on whether or not OCR picks up the whitespace as spaces or not though.