Niels Ott

Computational Linguist

Hohenheimer Verständlichkeitsindex

Saturday, September 26. 2009 • Category: Automatic MindComment (1)Trackbacks (2)

-->

Intro

Tomorrow we are going to have elections to the Bundestag here in Germany. Democracies come in all kinds of flavors. In the German flavor, the Bundestag is the parliament superordinate to all federal states (Bundesländer). The Bundestag elections are most important ones in Germany. Traditionally, parties are presenting manifestos. Wikipedia does not miss the point by claiming that »in recent decades the status of electoral manifestos has diminished somewhat due to a significant tendency for winning parties to, following the election, either ignore, indefinitely delay, or even outright reject manifesto policies which were popular with the public.« Manifestos are promises that are seldom kept. Even worse, they do not even seem to be written for a public audience. At least this is what the results of an investigation based on content analysis and readability measures suggest. In this blog post I am trying to present how researchers from Hohenheim University and a private company in Ulm are using readability to assess the understandability of manifestos.

Electoral Manifestos, Quantitatively…

So first of all, what were these results mentioned in the intro? Let me quickly summarize what the researchers around Professor Frank Brettschneider from Hohenheim University are presenting online together with CommunicationLab, a company from Ulm. The results the media are fascinated about are primarily the ones shown in this figure:

Electoral Manifestos of German parties as scored by the Hohenheim Comprehensibility Index.
Source: Ergebnis-Präsentation Wahlprogramm-Check 2009, transferred to English by Niels Ott.

They defined a Hohenheim Comprehensibility Index (Hohenheimer Verständlichkeitsindex). The figure shows how well the electoral manifestos of the major German parties perform on that index. These figures are the outcome of the quantitative analysis conducted by researchers. The inherent dream of all readability research lies in these results: one single and simple figure for scoring a text on a scale. It is somewhat unfortunate that all manifestos are rather hard to understand. Especially the one of Die Linke (the left-wing party) is quite close to a doctoral dissertation in political science, which is certainly not the appropriate level of difficulty for reaching masses. Interestingly, the winner in terms of this index, Die Grünen (The Greens), presents a manifesto that is too long to read (about 52,000 words) while the loser Die Linke comes up with the shortest one with about 20,000 words.

Conducting such an analysis rewarded the researchers with a rather high attention in the media, with regard to the fact that this is only a single study. Various online sites of renowned newspapers and blogs added their interpretations depending on political opinions and tastes. (A small selection in German: Welt Online, Zeit Online, Focus Online, TV report by ZDF)

… and Qualitatively

The Hohenheim people and those from CommunicationLab are well aware that there is more to text difficulty than counting the occurrences of traditional indicators. Having conducted a manual examination by experts, they claim that they can sustain the quantitative results. Briefly summarized, the manifestos have been written using too much technical language as well as using uncommon foreign words without explanation. Some criteria exceed the scope of pure readability, such as the one they call Wording seem to punish texts for inappropriate wording or the consideration of good and bad text layout.

Yet another Approach to approaching Language

Communication studies as an academic field looks at language from a different angle than (computational) linguistics does. Yet in the end, language remains language. Until a few weeks ago I did not even know of the existence of this field. Interestingly enough, the methodology of content analysis comes in a flavor called computational content analysis that uses some of the same technologies as computational linguists use and develop. CommunicationLab in Ulm therefore seems to be a spin-off company from this field that just accidentally deals with things I am also doing as a computational linguist, namely readability and automatic assessment of text difficulty.

Computational content analysis will be a further topic for me to explore in the future. In the past I encountered a number of ways of dealing with language. The purist computer scientist may say that information is a sequence of symbols from a defined alphabet. In information retrieval, there is not overwhelmingly much linguistics behind the processing. In computational linguistics, some people make things so linguistically precise that nothing works out in the end. Where on the scale between naivety and sophistication concerning language will content analysis turn out to be placed?

So how does the Hohenheimer Verständlichkeitsindex work?

As reported in personal communication with Jan Kercher from Hohenheim University, their formula is still work in progress. They are in a phase of tuning and testing it, aiming to publish scientific work about it. For now they only reveal the ingredients and the general procedure. There is no full recipe for baking that cake yet. So here we go with their list of variables used in the Hohenheimer Verständlichkeitsindex:

  • Readability measures:
    • Amstad Formula
    • Wiener Sachtext Formula
    • Simple Measure of Gobbledygook (SMOG)
    • Lesbarhetsindex (LIX)
  • Other parameters:
    • Average sentence length
    • Average word length
    • Proportion of words with more than 6 characters
    • Proportion of embedded sentences
    • Proportion of sentences containing more than 20 words

Apart from the syntax-based measures (embedded sentences), these other parameters are similar to the basic ingredients in any readability measure published since Rudolf Flesch made readability popular in the 1940ies. The researchers compiled a corpus of hard texts using the abstracts of dissertations in political sciences. Furthermore, they compiled a corpus of easy texts from the politics pages of the German tabloid Bild. Every single variable listed above was then scaled to a value between 0 (very bad) and 10 (very good). The averages of these values for both the readability measures and the other parameters (see above) were computed. The sum of these two average figures makes the actual comprehensibility index.

Now what's the difference to all those readability formulas out there, including the ones which I mentioned in previous posts and which are computed in the Information Retrieval for Language Learning prototype? For me the main difference seems to be the fact that the Hohneheimer Verständlichkeitsindex is designed exclusively for the use with German texts about politics. It is likely that it will perform quite well if used solely with texts from that genre. I am looking forward to read their publications.

Outro

A new readability measure for German texts on politics is emerging from communication research. It will be interesting to see the final product and its impact. The creators of the new formula from Hohenheim University and CommunicationLab made a smart move in applying their methods of content analysis to political manifestos since this assures them a remarkable amount of public interest. The plain and simple scores produced by a readability measure do their stuff once more: there is a significant media response. While this may make others researchers jealous, only the upcoming publications will show what exactly they did and and how to evaluate it.

2 Trackbacks

  1. DrNI@AM schreibt über Parteiprogramme und den Hohenheimer Verständlichkeitsindex.
  2. Kommunikationsforscher der Uni Hohenheim haben einen Verständlichkeitsindex entwickelt und damit die Programme der im Bundestag vertretenen Parteien analysiert. Zur Methode schreibt Niels Ott in seinem Computerlinguistikblog

1 Comments

Display comments as (Linear | Threaded)
  1. Having looked at the chart, I'd like to think that the ability to write readable manifestos correlates inversely with ideological stubbornness.

Add Comment


Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA