I am using pipeline builder to extarct text from pdf using layout aware method (text and tables). While extraction I am having issue with product codes which is a alphanumeric code. For example “ABGH00WA”, But text is extracted as “ABGHOOWA”.
Not always but in some of the cases it is getting confused with number 0 with alphabet O and also in some scenarios it’s confused with IQ with 1Q.
FYI : My end goal is to create a chat assistant
How to overcome these issues? Any help will be appreciated
This is a common challenge when extracting text from PDFs, especially with layout-aware methods. Optical Character Recognition (OCR) engines can sometimes confuse similar-looking characters, such as “0” (zero) and “O” (capital o), or “I” (uppercase i), “1” (one), and “l” (lowercase L).
Here are a few strategies to help mitigate these issues:
Post-Processing with Validation Rules
If your product codes follow a specific pattern (e.g., always 8 characters, always alphanumeric), you can use regular expressions or custom validation logic to detect and possibly correct common mistakes (e.g., replacing “O” with “0” if the code should contain a zero).
Dictionary or Reference List
If you have a list of valid product codes, you can match extracted codes against this list and use fuzzy matching (e.g., Levenshtein distance) to auto-correct minor extraction errors.
Manual Review for Edge Cases
For critical applications, consider flagging low-confidence extractions for manual review.
Multiple Extraction Methods
Sometimes running more than one extraction method (e.g., both layout-aware and plain text extraction) and comparing results can help catch inconsistencies.
While it’s difficult to guarantee 100% accuracy with OCR and LLMs, combining these strategies can help reduce errors, especially for structured data like product codes.