What we built
We built a custom rich text editor widget inside a Foundry Action Form that lets users write a description with inline images. When the action is submitted, the full HTML is stored in a description field. The widget works as expected on the surface.
The issue
The HTML stored in the description field embeds images as base64 strings directly inside the markup. This creates a hard blocker for any downstream usage:
-
AIP Logic can’t process it — the raw HTML with large base64 blobs is too heavy and unstructured for a language model to meaningfully analyse or act on the description content
-
External applications can’t consume the images — downstream systems expect image URLs or binary references, not raw base64 strings embedded inside an HTML field
-
Image-based analysis is blocked — images can’t be piped to vision models, OCR, or classification pipelines because they aren’t discrete objects; they’re buried inside the HTML string
Has anyone run into this and found a clean workaround or a platform-level solution? Is there a recommended way in Foundry to handle images inserted via a rich text editor in Action Forms so they’re stored as proper references rather than inline base64? Any guidance or feature pointers would be appreciated.
Would be one solution maybe that in your action form when the user saves the HTML, you also save the raw text in a separate column? So this separate column then can be passed later on easily to other systems, AIP logic etc?
You probably thought about this already, anything that speaks against it?
Thanks for the suggestion! We did consider storing a plain text version in a separate column, but the core problem with that approach is that it loses the relationship between the text and the images entirely.
Our use case is very similar to how Jira handles task descriptions — imagine someone writes “change the color of this button” and attaches a screenshot of the button they’re referring to. The image isn’t supplementary, it’s part of the meaning. Stripping it out into plain text leaves the AI or downstream system with half the context and no way to understand what the instruction is actually pointing to.
We need the text and images to be processed together as a unit, so AIP Logic (or any other system) can read the description and see the referenced image in the same pass. A separate raw text column doesn’t solve that — it just moves the problem.
Ah I understand.
Never tried it myself so can’t talk about that with certainty but I heard some people using services like that: https://unstract.com/llmwhisperer/
That help the LLM understand where content on the page is. So instead of saving Raw Content you would save Position Aware Content. Hope this helps.