A Preview of my Readability Library
Geschrieben von DrNI
am Sonntag, 14. Dezember 2008
um 12:56
in Computational Linguistics, English Posts
Is there anything worse than procrastinating the studying for your MA oral exam by writing Java code for your MA thesis? Anyways. Here we go with a little preview on a Java library of readability measures. I'm planning to write a longer article about the sense and senselessness of readability measures for my CL Blog. To cut a long story short: readability measures resp. algorithms take a text, split it into words and syllables, and apply some weird formula to that. In the end, you get a figure saying how easy or difficult to read (or understand) the text is supposed to be. One of the most prominent measures is the Flesch-Kincaid Readability Test which is supposed to say how many years of US education one needs in order to be able to understand the given text. 
Let's have a look at the screen shot of my demo. First of all, be aware that some or all of the measures might be wrong. As one can see, for the given text, it takes almost 14 years of school education. The text I took is a pirates story for kids from Neopedia, which some of my fellow students might be well aware of because they are currently suffering from a named entity annotation task for that text. So why is this fairy-tale alike story so hard to read? A comparison with the output of this online tool revealed that the sentence counter I'm using cannot deal with the quotation marks used in direct speech and the text contains lots of it. The Flesch-Kincaid formula punishes documents for long sentences, therefore the score goes up the fewer sentences you have.
The sentence counting part currently is based on Java Fathom, a port from Perl's Lingua::EN::Fathom module. The syllable counter is also the Fathom port from there. Apart from that, Java Fathom has a bug preventing it from working at all. I contacted the maintainer. He keeps reacting with silence. So in order to be able to publish this library, I need to re-invent some wheels myself, because other people messed things up. (This is what usually happens if computer scientists try to do something with language.)
As some of my readers may have noticed, I reactivated the Computational Linguistics category here. I consider it to be the CL blather dump from now on. After all, this post isn't enough of a post for my CL Blog.
Stay tuned on both blogs, if everything works out as I hope it to work out, I'll pass the exam next week and I'll publish the open-sourced readability library somewhen in January.




