Custom Text Extraction for Tracked-Deletion Text in Word Documents

sm110101 · April 9, 2026, 8:33pm

Hey everyone! I’m working on a Foundry pipeline where users upload redlined Word documents (.docx), and an LLM in AIP Logic compares proposed deletions against gold-standard language stored in an ontology object. The AI reasoning side is mostly sorted — my blocker is the preprocessing step.

The Problem:
Foundry’s native text extraction and AIP’s built-in document handling don’t appear to have a way to detect and tag tracked-deletion text (strikethrough / <w:del> nodes in the underlying Word XML). I need a custom function that:

Reads the raw .docx file
Parses the underlying XML (or converts to HTML) to identify tracked deletions
Returns plain text with deleted segments specially tagged (e.g. wrapped in <strikethrough> tags) before the text ever reaches AIP Logic

What I’ve Tried:

Python Functions API — media references don’t appear to be supported yet, which is a blocker since uploaded docs come in as media references via a media set
TypeScript Functions (v1) — seems like a potential path but I’m not deeply familiar with it and haven’t validated whether media references are accessible there either

Questions:

Has anyone successfully consumed a media reference inside a Foundry Function (Python or TypeScript) to do custom binary file processing?
Is a pipeline/transform approach (async enrichment writing back to the ontology object) the more practical route here?
Any other approaches worth exploring?

Would love to hear how others have tackled this. Thanks in advance!

Joel · April 10, 2026, 6:26pm

Never tried this, but I agree with AI’s suggestion pasted below that lxml might be able to extract the tracked-deletion text:

My recommendation is: use lxml plus Python’s built-in ZIP handling as the most reliable baseline, and consider docx-revisions if you want a higher-level abstraction over tracked changes. A DOCX is just a ZIP of XML files, and tracked deletions live in WordprocessingML nodes like <w:del>, so parsing word/document.xml directly is the cleanest path for preserving deletion semantics before text reaches AIP Logic. python-docx is useful for general DOCX handling, but it does not natively read/manipulate revision marks well, so it is not the right primary library for your blocker.

My point of view on architecture: a pipeline/transform approach is more practical than a Function for this use case. Foundry Python transforms are explicitly suited for non-tabular/binary file processing and let you read files from the transform filesystem in binary mode, which is exactly what you need for uploaded DOCX preprocessing. By contrast, the evidence for directly consuming media references inside Functions is much weaker here; TypeScript v2 clearly supports ontology operations and media upload patterns, but that is not the same as clean binary DOCX parsing from a media reference at request time, and TypeScript v1 has filesystem limitations that make it a poor fit.

What I would implement:

Store uploads in a Media Set and process them asynchronously in a Python transform.

In the transform, open the DOCX as binary, unzip it, parse word/document.xml, and detect <w:del> and optionally <w:ins> nodes.

Emit an enriched text field such as This is <strikethrough>deleted text</strikethrough> kept text, plus a structured revision table with deleted text, author, and timestamp if present.

Write the enriched output back to a dataset or ontology-backed object that AIP Logic reads, instead of asking AIP Logic to infer deletions from raw document text.

Library choice, bluntly:

Best open-source baseline: lxml

Best convenience option to test: docx-revisions

Avoid as primary parser for this need: python-docx

Best enterprise fallback if accuracy/completeness matters and licensing is acceptable: Aspose or Spire.Doc

One important trade-off: converting DOCX to HTML may be tempting, but it can blur Word revision semantics depending on the converter. If your real requirement is “preserve exact tracked deletions,” parse the raw Word XML first and only render to tagged text afterward. That reduces ambiguity and makes downstream LLM behavior much more deterministic.

My recommendation: build this as an async preprocessing transform, not a synchronous Function. It is the lower-risk Foundry-native design, it matches the binary-file access model better, and it keeps AIP Logic focused on reasoning instead of document forensics.