When I configure AIP Document Intelligence, I can see that it’s picking up all aspects of my document very well, and the render on the right side does show that it was able to identify tables properly and renders them well. All good there !
I then proceed to build an associated code repo project, I don’t make any changes in there, I just build it on the same media set I started out, and there I end up with a markdown that looks very different from the preview, with over half of my tables improperly extracted: rows split in half, data from the last row even put outside of the table, text within the table all jumbled up (in particular when I have a cell with text spanning 2 rows).
All of these got picked up perfectly fine in the preview, but something different must be happening between the preview and the code repo build because the output is very different.
Is there anything I’m missing to configure, or should I expect that the preview is indeed not meant to be representative of the data being extracted by code repo ?
The Code Repository should behave the same as the preview from the Document Intelligence. Apologies for your frustration here. We’re actively looking into this and try to reproduce the discrepancy here. Could you tell me roughly what is your extraction configuration (OCR vs. Vision LLM, which model / whether you have edited the prompt etc). We can try to reproduce at our end and try to solve the inconsistency here.
I’m using Generative AI with the Preprocess document option (ENG as language), Layout-aware OCR, using the default System and user prompts along with the default Claude 4 Sonnet. Once the repo is generated, I do not change anything other than pointing my input and output dataset at the right places.
Thank you for taking the time to describe the mismatch you saw between the AIP Document Intelligence UI and the markdown produced by the generated code repo. We have reproduced the behavior and traced it to a regression introduced last Nov.
A fix has already been prepared and is being deployed to our production pipeline this week. Barring any unforeseen delays, it should reach the stack your project is using early next week. When the rollout is complete, we recommend redeploying your Document Intelligence configuration to a new python transform code repo, it should yield markdown that matches what you see in the preview. If you’d like to keep using your current code repo, you’d have to upgrade the aip-workflows library in your code repo to the latest version.
We’re sorry for the inconvenience and appreciate you flagging the issue—it helped us catch the regression quickly. Please reach out if anything looks off after the update or if you have any other questions.
Thanks again for using AIP Document Intelligence!
Best regards,
Product Engineering Team | AIP Document Intelligence