Dennis G Perry, PhD, MBA
2 May 2026

To stop LLMs from corrupting documents, the safest approach is to separate content generation from file manipulation. Let the LLM draft, revise, summarize, or restructure text, but do not let it freely rewrite the underlying .docx, .pptx, .xlsx, or PDF package unless there is a controlled process around it.
The core problem
Most “corruption” occurs because complex document files are not really single-text files. A Word document, for example, is a structured ZIP package containing XML, relationships, media, styles, numbering, headers, footers, metadata, comments, and embedded objects. If an LLM or automation process rewrites that structure too broadly, it can break links, numbering, styles, tables, images, cross-references, or the document package itself.
Best practices
1. Never let the LLM directly rewrite the whole binary document
Avoid workflows where the LLM is asked to “edit this Word file” by regenerating the entire file structure from scratch. That is where the risk of corruption is highest.
Better workflow:
Extract the text.
Have the LLM revise the text.
Reinsert the revised text into the document using a deterministic tool or controlled script.
Preserve the original styles, sections, headers, footers, tables, images, and numbering.
2. Use a copy, never the original
Before using an LLM-assisted workflow, create:
Original document
Working copy
Output copy
Keep the original untouched. Every AI edit should be reversible.
3. Use tracked changes or patch-based editing
The safest model is not:
“Rewrite this whole document.”
The safer model is:
“Change paragraph 4 to this.”
“Replace this heading with that.”
“Insert this section after Section 2.”
“Do not modify tables, figures, captions, styles, headers, footers, or references.”
Patch-based edits reduce the chance of unintended structural damage.
4. Lock down document structure
For Word documents, protect or preserve:
Styles
Section breaks
Headers and footers
Footnotes and endnotes
Tables
Captions
Cross-references
Table of contents fields
Numbered lists
Images and anchors
Bibliography/reference fields
If the LLM is only revising prose, it should not touch those elements.
5. Avoid round-tripping through Markdown unless layout is simple
Markdown is useful for drafting, but it can destroy document fidelity when converted back to Word or PDF. It often loses:
Precise spacing
Custom styles
Complex tables
Image placement
Footnotes
Numbered heading hierarchy
Captions
Cross-references
Page breaks
Use Markdown only for simple drafts, not for polished legal, technical, proposal, patent, or formatted business documents.
6. Use templates
Create a clean Word template with approved:
Fonts
Margins
Heading styles
Caption styles
Table styles
Reference style
Footer/header layout
Then insert AI-generated text into that template. Do not let the AI invent formatting each time.
7. Validate the file after generation
After an AI-assisted document is produced, open and inspect it in the native application.
For Word, check:
Does the file open without repair warnings?
Are headings correct?
Are tables intact?
Are images still anchored properly?
Does the table of contents update?
Are references still present?
Are page breaks preserved?
Are numbered lists still sequential?
Then save it once manually from Word. This often normalizes the file.
8. Use PDF only as final output
Do not use PDF as the editing source unless there is no alternative. PDFs are display-oriented, not editing-oriented. Extracting and regenerating content from PDF often causes layout errors.
Preferred sequence:
Word source → controlled edits → Word review → PDF export
Not:
PDF → AI extraction → regenerated Word → PDF
9. Give the LLM strict boundaries
Use instructions like this:
“Revise only the body text. Do not change headings, numbering, tables, figures, captions, citations, page breaks, headers, footers, or styles. Return only the replacement text for the specified section.”
Or:
“Identify recommended edits, but do not rewrite the document. Provide a change list with section names, original text, revised text, and rationale.”
10. Use structured intermediate formats
For high-value documents, use a controlled editing format such as:
JSON change list
CSV edit register
XML patch
Tracked-change table
Section-by-section replacement text
This gives you auditability and prevents the AI from making invisible structural changes.
A reliable workflow
Use this process:
- Keep the original document untouched.
- Extract only the target text.
- Ask the LLM to revise that text.
- Review the revised text.
- Insert it into a copy of the document.
- Preserve the original formatting and styles.
- Validate the file in Word, PowerPoint, Excel, or Acrobat.
- Export final PDF only after review.
For your kind of documents
For formal Word documents with IEEE references, headings, tables, and graphics, the safest instruction is:
“Use the existing Word document as a template. Preserve all styles, headings, numbering, tables, captions, references, headers, footers, and page layout. Replace only the specified prose sections. Do not regenerate the entire document from scratch.”
That one instruction prevents many failures.
Bottom line
LLMs corrupt documents when they are allowed to behave as file editors rather than content editors. The solution is to make the LLM produce controlled textual changes, while deterministic document tools handle the actual file structure.
Leave a Reply