A Comparison of Features for Automatic Readability Assessment

Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad · 2010 · Proceedings of the 23rd International Conference on Computational Linguistics (COLING '10), Posters · doi:10.5555/1944566.1944598

Summary

Feng et al. (2010) is a computational-linguistics paper that compares a wide variety of text features for automatically predicting the grade level of reading material aimed at primary-school students. The motivation is both intrinsic — traditional readability formulas such as Flesch-Kincaid, Gunning FOG, and SMOG rely on shallow counts of syllables and sentence length and perform poorly on realistic corpora — and extrinsic, because robust grade-level prediction is a building block for automatic text simplification systems that help readers with limited literacy, including people with intellectual disabilities, second-language learners, and children. The authors treat readability as a multi-class classification task over a 1,433-article corpus of Weekly Reader magazine content labelled for grades 2 through 5. They systematically evaluate five families of features: (1) shallow features such as average sentence length, Flesch-Kincaid score, and Chall-Dale difficult-word rate; (2) language-modelling features based on perplexity scores from unigram through 5-gram models trained on text, POS tags, and word/POS pairs; (3) parsed syntactic features from the Charniak parser, including counts of noun phrases, verb phrases, prepositional phrases, and SBARs; (4) part-of-speech features covering nouns, verbs, adjectives, adverbs, and prepositions as content or function words; and (5) discourse features comprising entity-density, lexical chains, coreference inference, and Barzilay-Lapata entity grids. Feature sets are evaluated with LIBSVM and Weka Logistic Regression under 10-fold cross-validation, with combinations stress-tested through group-wise greedy feature selection.

Key findings

The paper's headline result is a ~74% classification accuracy from a judicious combination of features selected by group-wise add-one-best greedy search, a substantial jump from the 63% state-of-the-art reported in prior work. Language-modelling features, especially those trained directly on the Weekly Reader corpus rather than external corpora, had the highest individual predictive power (68.38% with LIBSVM). Among discourse features, entity density was the strongest single family (59.63%), outperforming lexical chain, coreference inference, and entity grid features; combining discourse subsets gave no benefit over entity density alone. Among POS features, nouns were the most predictive word class (58.15%), followed by prepositions; verb-phrase and noun-phrase counts matched their word-level counterparts among parsed syntactic features. A striking negative result: the Flesch-Kincaid formula correctly predicted only 20 of 1,433 texts (1.4% accuracy) when used as a fixed formula, although treating its components as features in a logistic-regression model yielded above 50% accuracy. Among shallow features, average sentence length dominated — cheaper to compute and more useful than most syntactic features. Entity-density and POS-noun features were highly correlated, indicating that the predictive signal of entity density is largely captured by noun counts.

Relevance

For accessibility practitioners, this paper is a foundational reference for understanding how automatic readability assessment actually works and why off-the-shelf readability formulas often mislead. It matters for cognitive accessibility, plain-language programmes, and text-simplification pipelines that target readers with intellectual disabilities or low literacy: practitioners should not trust a single Flesch-Kincaid score as ground truth, and should recognise that average sentence length is a cheap and competitive metric in its own right. The paper also legitimates investment in NLP-based readability pipelines over heuristic formulas when gradient-of-difficulty decisions drive content adaptation (e.g., serving different versions of the same material). Limitations include a narrow grade-2–5 corpus from a single publisher, a focus on generic primary-school text rather than specific accessibility use cases, and evaluation that measures classifier accuracy rather than the downstream user experience of readers who rely on simplified content.