Measuring Text Simplification with the Crowd

Walter S. Lasecki, Luz Rello, Jeffrey P. Bigham · 2015 · Proceedings of the 12th International Web for All Conference (W4A) · doi:10.1145/2745555.2746658

Summary

This paper investigates whether non-expert crowd workers can reliably evaluate the simplicity of English text, addressing a key gap in text simplification research. Text simplification — reducing the lexical and syntactic complexity of sentences while preserving meaning — is critical for people with cognitive impairments, dyslexia, Down syndrome, autism, aphasia, and low literacy skills. However, evaluating how well text has been simplified remains challenging: automated readability measures rely on surface features like word and sentence length without understanding meaning, while expert human evaluators are expensive and often disagree with each other (inter-annotator agreement scores in prior work ranged from just 0.33 to 0.69, well below the 0.8 threshold considered reliable in computational linguistics). The researchers created a dataset of 60 sentences drawn from U.S. Federal Plain Language Guidelines — legal-genre text commonly found in official documents. Each sentence was incrementally simplified by applying one, two, or three Plain Language rules (such as using active voice, using pronouns to address readers directly, and omitting unnecessary words), creating five versions at different simplification levels. They recruited 250 workers from Amazon Mechanical Turk who provided 2,500 individual ratings on a 7-point Likert scale assessing how simple each sentence was to read.

Key findings

Crowd workers could accurately distinguish between different levels of text simplification. When sentences underwent substantial changes (more than 10 word-level edits), crowd ratings showed a statistically significant 76.5% increase in perceived simplicity as more rules were applied (p<.05). Sentences with minor changes showed no significant difference (6.6% increase, p>.05), confirming workers were responding to actual simplification rather than randomly rating. The study found that approximately 25 workers (50% of the 50-worker pool) provided sufficient convergence to produce reliable results, with ratings monotonically converging toward a stable consensus as more responses were added. Even sampling just 25% of workers yielded results within 11.5% of the final collective answer. The crowd ratings aligned with established readability research showing that shorter sentences and shorter words are more readable for people with cognitive disabilities. The cost was approximately /bin/zsh.50 per sentence rated, with workers completing tasks in under 90 seconds — cheaper and faster than expert evaluation, and parallelizable across multiple sentences simultaneously.

Relevance

This research has direct implications for how accessibility practitioners evaluate content simplification efforts. Rather than relying solely on automated readability formulas (Flesch-Kincaid, Fog Index, SMOG) that miss nuances of meaning and comprehension, crowdsourced evaluation offers a human-centered alternative that captures whether simplification changes actually make text feel simpler to readers. The finding that non-expert crowd workers can reliably measure simplification levels is particularly valuable for organizations working to meet plain language requirements or WCAG cognitive accessibility guidelines. The study also highlights the limitations of binary "simplified/not simplified" classifications, showing that fine-grained simplicity ratings provide more useful feedback for iterating on content. One limitation is that the study tested only legal-domain text, and results may not generalize across all content types. The researchers also did not measure actual comprehension, only perceived simplicity.

Tags: text simplification · crowdsourcing · natural language processing · readability · cognitive accessibility · plain language

Standards referenced: Plain Language Guidelines