iTagPDF: Towards Finally Automating PDF Accessibility

Peya Mowar, Aaron Steinfeld, Jeffrey P. Bigham · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3790289

Summary

This CHI 2026 paper from Mowar, Steinfeld, and Bigham (Carnegie Mellon) presents iTagPDF, an automated system for tagging academic research PDFs so they are accessible to screen reader users. The authors argue that PDF accessibility has remained a persistent problem for over a decade because existing remediation tools (Adobe Acrobat, PAVE, Ally) work purely from the visual rendering of the PDF—they try to reconstruct semantic structure from pixels and bounding boxes, producing cluttered, error-prone tags that require tedious manual correction. Their central insight is that the semantic information needed for good tagging already exists in the original authoring source (e.g., LaTeX) but is discarded during PDF rendering. iTagPDF is the first system to combine both representations: it runs a fine-tuned YOLOv11 object detector and Tesseract OCR on the visual PDF, parses the LaTeX source into semantic "chunks," and then uses a GPT-4o LLM agent to align the two views. The aligned output is used to produce accessibility tags (H1–H3, P, Figure, Table, List, Caption, Formula, Author, BibEntry, Artifact, Note, TH, TD, LI), correct reading order, and embed content-specific metadata (table structure, figure alt text, raw LaTeX as alt text for formulas) back into the PDF via the iText Core SDK. The authors also contribute a new evaluation dataset of 40 academic PDFs (20 CHI, 20 ASSETS, 554 pages, 10,000+ tagged segments) with custom accessibility-oriented metrics: Box Error Rate, Classification Error Rate, and Reading Order Error Rate.

Key findings

On the 40-paper evaluation dataset, iTagPDF achieved 95.28% bounding box accuracy, 95.96% classification accuracy, and 97.26% reading order accuracy. On a 25-paper subset comparing against Adobe Acrobat's auto-tagging and authors' manually-submitted PDFs, iTagPDF's normalized score was 95.19% versus Acrobat's 87.03% and authors' 81.53%. iTagPDF surpassed human authors on reading order (96.42% vs 88.62%), headings (91.58% vs 71.92%), lists with structure (92.30% vs 51.51%), and captions (82.92% vs 0%—neither Acrobat nor authors tagged captions at all). It was also the first open, reproducible method to tag table structure (TH/TD roles) automatically. Tagging was consistent across runs (normalized accuracy 94.06%, σ = 0.80), and generalized reasonably to non-LaTeX venues like ICASSP, NeurIPS, and Nature Physics, though <Author>, <Caption>, and <Formula> detection was weaker there. Failure modes included merging figures with captions, mistakenly tagging non-English (Japanese) paragraphs as artifacts, and redundant <Artifact> tags when no source chunk mapped. Strikingly, Adobe Acrobat produced up to 10× more tags than needed by inflating the document with empty <Span> tags, and authors submitted zero caption tags across both CHI and ASSETS corpora.

Relevance

For accessibility practitioners and scholarly publishers, this work is important because it reframes the decade-long PDF accessibility problem: instead of "remediation" (reconstructing structure from pixels), the authors advocate "preservation"—keeping the semantic structure that already exists at authoring time. Practical takeaways: (1) current author-submitted accessibility tagging at top HCI venues is poor, even at ASSETS, so conferences cannot rely on author compliance alone; (2) automated tagging that combines source and visual representations is now accurate enough to exceed both Acrobat and manual author tagging on most criteria; (3) publishers with access to LaTeX source (arXiv, ACM Digital Library) could realistically run a tool like iTagPDF as a production step. Limitations: iTagPDF currently requires LaTeX source, so Word/InDesign workflows and older PDFs are out of scope; figure alt text is still LLM-generated and needs author review; the system relies on a paid LLM (GPT-4o). The work opens important policy questions about when automation should replace author accountability.

Tags: PDF accessibility · tagged PDF · PDF remediation · automated accessibility · document layout analysis · LLM · vision-language models · LaTeX · alt text · reading order · scholarly publishing · automated testing

Standards referenced: PDF/UA · ACM Accessibility Guidelines