Text Extraction confused 0 with O

Hi,

I am using pipeline builder to extarct text from pdf using layout aware method (text and tables). While extraction I am having issue with product codes which is a alphanumeric code. For example “ABGH00WA”, But text is extracted as “ABGHOOWA”.

Not always but in some of the cases it is getting confused with number 0 with alphabet O and also in some scenarios it’s confused with IQ with 1Q.

FYI : My end goal is to create a chat assistant

How to overcome these issues? Any help will be appreciated

This is a common challenge when extracting text from PDFs, especially with layout-aware methods. Optical Character Recognition (OCR) engines can sometimes confuse similar-looking characters, such as “0” (zero) and “O” (capital o), or “I” (uppercase i), “1” (one), and “l” (lowercase L).

Here are a few strategies to help mitigate these issues:

  1. Post-Processing with Validation Rules

    If your product codes follow a specific pattern (e.g., always 8 characters, always alphanumeric), you can use regular expressions or custom validation logic to detect and possibly correct common mistakes (e.g., replacing “O” with “0” if the code should contain a zero).

  2. Dictionary or Reference List

    If you have a list of valid product codes, you can match extracted codes against this list and use fuzzy matching (e.g., Levenshtein distance) to auto-correct minor extraction errors.

  3. Manual Review for Edge Cases

    For critical applications, consider flagging low-confidence extractions for manual review.

  4. Multiple Extraction Methods

    Sometimes running more than one extraction method (e.g., both layout-aware and plain text extraction) and comparing results can help catch inconsistencies.

While it’s difficult to guarantee 100% accuracy with OCR and LLMs, combining these strategies can help reduce errors, especially for structured data like product codes.