<?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:admin="http://webns.net/mvcb/"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
   xmlns:wfw="http://wellformedweb.org/CommentAPI/"
   xmlns:content="http://purl.org/rss/1.0/modules/content/"
   
	 xmlns:podcast='http://ipodder.sourceforge.net/docs/podcast.html'
>
<channel>
    <title>Niels Ott - Automatic Mind</title>
    <link>http://www.drni.de/niels/s9y/</link>
    <description>Computational Linguist</description>
    <dc:language>en</dc:language>
    <generator>Serendipity 1.4.1 - http://www.s9y.org/</generator>
    <pubDate>Tue, 23 Feb 2010 23:20:12 GMT</pubDate>

    <image>
        <url>http://www.drni.de/niels/s9y/templates/default/img/s9y_banner_small.png</url>
        <title>RSS: Niels Ott - Automatic Mind - Computational Linguist</title>
        <link>http://www.drni.de/niels/s9y/</link>
        <width>100</width>
        <height>21</height>
    </image>

<item>
    <title>Named Entity list from German Wikipedia</title>
    <link>http://www.drni.de/niels/s9y/archives/16-Named-Entity-list-from-German-Wikipedia.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/16-Named-Entity-list-from-German-Wikipedia.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=16</wfw:comment>

    <slash:comments>1</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=16</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;p&gt;In 2008, I developed an anagram solver that is still presented online &lt;/a&gt; in &lt;a href=&quot;http://www.drni.de/blog/archives/502-Trichterwoerter-und-andere-bescheuerte-Raetsel-im-TV.html&quot;&gt;some&lt;/a&gt; &lt;a href=&quot;http://www.drni.de/blog/archives/524-Noch-mehr-Trichterwoerter.html&quot;&gt;German&lt;/a&gt; &lt;a href=&quot;http://www.drni.de/blog/archives/698-Iranbuschweg-Der-Wahnsinn-nimmt-kein-Ende.html&quot;&gt;posts&lt;/a&gt; of my other blog. It actually is an anagram solver but it comes as a riddle solver for a stupid type of TV quizzes. To cut a long story short: do never ever call these numbers on the screen. If you just want to know the solution, use the linked pages and turn off your TV right after.&lt;/p&gt;

&lt;p&gt;So what&#039;s interesting about anagram solvers and these riddles? Well, one needs a fairly large word list. I took the one from the Ispell dictionary, the one I have  online on  the page of &lt;a href=&quot;http://www.drni.de/niels/s9y/pages/bananasplit.html&quot;&gt;BananaSplit&lt;/a&gt;. I soon found out that these riddles made use of celebrity names a lot. Yet my word list back  then contained only a few named entities. I solved this issue by Wikipedia harvesting. What do all humans have in common? Right, they are born at some point. And of course, everybody has to die. Now Wikipedia has these births and deaths lists in the entry for each year of the calendar. Back in May 2008, I downloaded all pages from year 1000 to year 2008 and simply extracted all the persons that had ever been born or that had died. This gave me a list of 51.353 person names from German Wikipedia.&lt;/p&gt;

&lt;p&gt;Recently a colleague asked me if he could have this list. So why not simply putting it online? Here we go.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/wp-person-names/08-may/wikipedia-persons-may08.csv.bz2&quot;&gt;Person name list&lt;/a&gt; from Wikipedia, CSV (bz2 compressed, 0,55MB)&amp;#160;&amp;ndash; downloaded in May 2008.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/wp-person-names/08-may/extract-persons.pl&quot;&gt;Perl-Script&lt;/a&gt; used for extracting the named entities, in case you want to do this yourself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope this is useful for some people out there. Feel free to let me know what you&#039;re using it for or what improvements could be made. You can use the comment box below or write me an &lt;a href=&quot;http://www.drni.de/niels/s9y/pages/contact-me.html&quot;&gt;e-mail&lt;/a&gt;.&lt;/p&gt;  
    </content:encoded>

    <pubDate>Wed,  3 Feb 2010 11:11:00 +0100</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/16-guid.html</guid>
    
</item>
<item>
    <title>Ärger mit der Örgele</title>
    <link>http://www.drni.de/niels/s9y/archives/15-AErger-mit-der-OErgele.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/15-AErger-mit-der-OErgele.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=15</wfw:comment>

    <slash:comments>4</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=15</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;div class=&quot;drniBlogDecoImage&quot;&gt;&lt;img src=&quot;http://drni.de/niels/n3files/cl-blog/oergele1.jpg&quot;&gt;&lt;/div&gt;&lt;p&gt;&lt;em&gt;Apologies to my English speaking readers for the rest of this article being written in German. It simply seems odd to formulate complaints about a southern German newspaper article dealing with a southern German dialect in English. Stay tuned, more general stuff will come in English as usual in the future.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;»Aus Lorch ereilt uns die überraschende Nachricht, dass es ein schwäbisches Wort für den Bohrfutterschlüssel gibt. Es lautet &lt;em&gt;Ärgele&lt;/em&gt;.« So beginnt eine weitere Ausgabe der Kolumne &lt;em&gt;Schwäbisch auf Anfrage&lt;/em&gt; von Henning Petershagen, abgedruckt wie üblich im samstäglichen Magazin der Südwest Presse, genauer am 9. Januar 2010. Leider gibt es den diskussionswürdigen Beitrag beim Verlag nicht online, weswegen ich das Risiko eingegangen bin, &lt;a href=&quot;http://drni.de/niels/n3files/cl-blog/2010-01-09-suedwest-presse-magazin-schwaebisch-aergele.jpg&quot;&gt;eine Kopie zur Verfügung zu stellen&lt;/a&gt;. In diesem Post geht es auch nicht darum, die Holzmedien anzugreifen, vielmehr wurde beim Lesen der Kolumne mein &lt;a href=&quot;http://www.iaas.uni-bremen.de/sprachblog/&quot;&gt;Stefanowitsch-Reflex&lt;/a&gt; getriggert: Wenn sich Leute in der Öffentlichkeit über Sprache äußern und dabei Grundlagen der Sprachwissenschaft übersehen, dann muss das doch von jemandem kommentiert werden. Und da bietet sich ein Blog-Post doch geradezu an. Doch der Reihe nach&amp;hellip;&lt;/p&gt; &lt;p&gt;Zunächst berichtet die Kolumne von einem Leser&amp;#160;&amp;ndash; einem Nichtschwaben&amp;#160;&amp;ndash; der das Wort &lt;em&gt;Ärgele&lt;/em&gt; entdeckt habe. Nun will er wissen, wo es denn herkomme und was es damit auf sich habe. Es ist natürlich verständlich, dass ein Nichtschwabe sich damit nicht so gut auskennt, und letztendlich findet Petershagen die richtige Erklärung: Beim &lt;em&gt;Ärgele&lt;/em&gt; handelt es sich um einen Diminutiv von &lt;em&gt;Orgel&lt;/em&gt;. Er fischt herum und findet andere Dinge, die ein Örgele sind: Der Schlüssel, der bei älteren Schlittschuhen zum Festziehen der Bindung benutzt wurde. Letztendlich landet er bei der Drehorgel. Das mag ja alles richtig sein, doch was überhaupt nicht passt ist die Unterscheidung zwischen &lt;em&gt;Ärgele&lt;/em&gt; und &lt;em&gt;Örgele&lt;/em&gt;. Dabei ist die Erklärung recht simpel, wenn man sich mit Phonetik und Phonologie (zweites Semester Sprachwissenschaft in Tübingen) mal befasst hat.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ärgele&lt;/em&gt; und &lt;em&gt;Örgele&lt;/em&gt; sind exakt das gleiche Wort. Dass &lt;em&gt;Örgele&lt;/em&gt; für einige Sprecher die zusätzliche Bedeutung eines Bohrfutterschlüssels hat, das ist eine andere Geschichte. Man muss hier beachten, wie es sich mit der Zuordnung von schriftlicher und gesprochener Sprache verhält. Dazu kommt das &lt;a href=&quot;http://de.wikipedia.org/wiki/Phoneminventar&quot;&gt;Phoneminventar&lt;/a&gt;: Das Inventar ist eine Beschreibung der Phoneme, die eine Sprache kennt. Und hier muss man eben wissen, dass die standarddeutschen Laute für Ü ([ʏ] bzw. [yː]) und Ö ([œ] bzw. [øː]) im Schwäbischen nicht vorhanden sind. Für jüngere Sprecher könnte sich das derzeit natürlich ändern, es gilt aber auf jeden Fall für die NORMs (&lt;em&gt;non-mobile older rural males&lt;/em&gt;, ein Begriff von Chambers und Trudgill, gecoined in deren Werk &lt;em&gt;Dialectology&lt;/em&gt;, Cambridge Univ. Press 1998).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Was der Schwabe nun macht, wenn er in der Schriftsprache auf ein Ö stösst, das hängt von der Umgebung ab. Der Einfachkeit halber lasse ich nun auch mal die Transkriptionen weg und denke mir eine standarddeutsche Aussprache. Generell wird der Schwabe ein Ö in der Schriftsprache wie ein E aussprechen. Zum Beispiel beim Ortsname &lt;a href=&quot;http://de.wikipedia.org/wiki/Mössingen&quot;&gt;&lt;em&gt;Mössingen&lt;/em&gt;&lt;/a&gt;, dieser wird zu &lt;em&gt;Messenga&lt;/em&gt; (wobei das -a hinten als kurzer, &lt;a href=&quot;http://de.wikipedia.org/wiki/Schwa&quot;&gt;Schwa&lt;/a&gt;-ähnlicher Laut zu verstehen ist). Bekannter und für Nichtschwaben anfangs erstaunlich ist sicherlich das einfache Wort &lt;em&gt;Ehl&lt;/em&gt;, bei dem es sich um ganz normales Öl handelt.&lt;/p&gt;

&lt;p&gt;Wieso wird nun also das Örgele in den Ohren eines Nichtschwaben (s.o.) nicht zu einem &lt;em&gt;Ergele&lt;/em&gt; sondern zu einem &lt;em&gt;Ärgele&lt;/em&gt;? Die Antwort darauf: &lt;a href=&quot;http://de.wikipedia.org/wiki/Assimilation_%28Phonologie%29&quot;&gt;Assimilation&lt;/a&gt;, also die gegenseitige Beeinflussung von Lauten in einer bestimmten Umgebung. Das kann durch ein minimales Paar von jedem selbst beobachtet werden: Spricht man die Worte &lt;em&gt;Ehre&lt;/em&gt; und &lt;em&gt;Ehe&lt;/em&gt; laut aus, so zeigt sich, dass im ersten Fall das E zu einem Ä tendiert, weil es vom nachfolgenden R beeinflusst wird. Das neutrale H in &lt;em&gt;Ehe&lt;/em&gt; hingegen lässt das E wie es ist. Das R im Schwäbischen kommt in verschiedenen Varianten vor, in manchen Gegenden sogar als rollendes R. Häufig ist jedoch ein &lt;a href=&quot;http://de.wikipedia.org/wiki/Pharyngalisierung&quot;&gt;pharyngalisiertes&lt;/a&gt; R, man könnte sagen, der Schwabe ›schluckt das R hinunter‹&amp;#160;&amp;ndash; es wird ganz hinten ausgesprochen. Dadurch wird ein zum E gewordenes Schriftsprachen-Ö noch viel mehr zum Ä, und schon sind wir beim &lt;em&gt;Ärgele&lt;/em&gt;. (Siehe auch: Markus Hiller, &lt;em&gt;Regressive Pharyngalisierung in Stuttgarter Schwäbischen als C-V-Interaktion&lt;/em&gt;, Linguistische Berichte (155), 1995)&lt;/p&gt;

&lt;p&gt;Das Ärgele bleibt dennoch das Örgele, denn es ist kein anderes Wort sondern nur die natürliche schwäbische Aussprache ein und desselben Begriffs. Zu dieser Folgerung kommt dann auch der Informant von Petershagen, dem er im Zitat den letzten Absatz dieser Folge der Kolumne überlässt. Die Ausführungen über Dinge zum Kurbeln, die Örgele heißen können, sind interessant, tun  bei der Erklärung der Aussprache aber leider nichts zur Sache.&lt;/p&gt;

&lt;div id=&quot;drniPostFootnoteSec&quot;&gt;&lt;hr id=&quot;PostFootNote&quot; /&gt;
Bildnachweis: &lt;a href=&quot;http://www.flickr.com/photos/dpup/3805549423/&quot;&gt;Darryl Coe&lt;/a&gt;, fotografiert von dpup, &lt;a href=&quot;http://creativecommons.org/licenses/by-nc-sa/2.0/deed.de&quot;&gt;CC&amp;#160;BY-NC-SA&lt;/a&gt;.&lt;/div&gt;
 
    </content:encoded>

    <pubDate>Tue, 26 Jan 2010 20:51:00 +0100</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/15-guid.html</guid>
    
</item>
<item>
    <title>Hohenheimer Verständlichkeitsindex</title>
    <link>http://www.drni.de/niels/s9y/archives/13-Hohenheimer-Verstaendlichkeitsindex.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/13-Hohenheimer-Verstaendlichkeitsindex.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=13</wfw:comment>

    <slash:comments>1</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=13</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;h3&gt;Intro&lt;/h3&gt;

&lt;p&gt;Tomorrow we are going to have elections to the Bundestag here in Germany. Democracies come in all kinds of flavors. In the German flavor, the Bundestag is the parliament superordinate to all federal states (Bundesländer).  The Bundestag elections are  most important ones in Germany. Traditionally, parties  are presenting manifestos. Wikipedia does not miss the point by &lt;a href=&quot;http://en.wikipedia.org/wiki/Manifesto&quot;&gt;claiming&lt;/a&gt; that »in recent decades the status of electoral manifestos has diminished somewhat due to a significant tendency for winning parties to, following the election, either ignore, indefinitely delay, or even outright reject manifesto policies which were popular with the public.« Manifestos are promises that are seldom kept. Even worse, they do not even seem to be written for a public audience. At least this is what the results of an investigation based on content analysis and readability measures suggest. In this blog post I am trying to present how researchers from Hohenheim University and a private company in Ulm are using readability to assess the understandability of manifestos.&lt;/p&gt;
 &lt;h3&gt;Electoral Manifestos, Quantitatively&amp;hellip;&lt;/h3&gt;

&lt;p&gt;So first of all, what were these results mentioned in the intro? Let me quickly summarize what the researchers around Professor Frank Brettschneider from Hohenheim University are &lt;a href=&quot;https://komm.uni-hohenheim.de/wahlprogramm-check.html&quot;&gt;presenting online&lt;/a&gt; together with CommunicationLab, a company from Ulm. The results the media are fascinated about are primarily the ones shown in this figure:
&lt;/p&gt;
&lt;div  class=&quot;drniBlogFigure&quot;&gt;&lt;object  data=&quot;http://www.drni.de/niels/n3files/cl-blog/hohenheim-elections-2009.svg&quot; width=&quot;554&quot; height=&quot;283&quot;
type=&quot;image/svg+xml&quot;
codebase=&quot;http://www.adobe.com/svg/viewer/install/&quot; /&gt;
&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/hohenheim-elections-2009.svg&quot;&gt;&lt;img
src=&quot;http://www.drni.de/niels/n3files/cl-blog/hohenheim-elections-2009.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;
&lt;/object&gt;
&lt;cite class=&quot;drniBlogFigure&quot;&gt;
Electoral Manifestos of German parties as scored by the Hohenheim Comprehensibility Index.&lt;br /&gt;
Source: &lt;a href=&quot;https://komm.uni-hohenheim.de/fileadmin/einrichtungen/komm/PDFs/Komm/Wahlprogramm-Check/Wahlprogramm-Check_Btg-Wahl_2009.pdf&quot;&gt;Ergebnis-Präsentation Wahlprogramm-Check 2009&lt;/a&gt;, transferred to English by Niels Ott.&lt;/cite&gt;
&lt;/div&gt;

&lt;p&gt;They defined a Hohenheim Comprehensibility Index (&lt;em&gt;Hohenheimer Verständlichkeitsindex&lt;/em&gt;). The figure shows how well the electoral manifestos of the major German parties perform on that index. These figures are the outcome of the quantitative analysis conducted by researchers. The inherent dream of all readability research lies in these results: one single and simple figure for scoring a text on a scale. It is somewhat unfortunate that all manifestos are rather hard to understand. Especially the one of &lt;em&gt;Die Linke&lt;/em&gt; (the left-wing party) is quite close to a doctoral dissertation in political science, which is certainly not the appropriate level of difficulty for reaching masses. Interestingly, the winner in terms of this index, &lt;em&gt;Die Grünen&lt;/em&gt; (The Greens), presents a manifesto that is too long to read (about 52,000 words) while the loser &lt;em&gt;Die Linke&lt;/em&gt; comes up with the shortest one with about 20,000 words.&lt;/p&gt;

&lt;p&gt;Conducting such an analysis rewarded the researchers with a rather high attention in the media, with regard to the fact that this is only a single study. Various online sites of renowned newspapers and blogs added their interpretations depending on political opinions and tastes. (A small selection in German: &lt;a href=&quot;http://www.welt.de/politik/deutschland/article4162711/Programme-fast-aller-Parteien-voellig-unverstaendlich.html&quot;&gt;Welt Online&lt;/a&gt;, &lt;a href=&quot;http://www.zeit.de/online/2009/28/parteien-wahlprogramme&quot;&gt;Zeit Online&lt;/a&gt;, &lt;a href=&quot;http://www.focus.de/politik/deutschland/wahlen-2009/bundestagswahl/tid-14769/wahlprogramme-die-linken-sind-die-koenige-der-schachtelsaetze_aid_413931.html&quot;&gt;Focus Online&lt;/a&gt;, &lt;a href=&quot;http://www.youtube.com/watch?v=9vL8NijZJZs&quot;&gt;TV report by ZDF&lt;/a&gt;)&lt;/p&gt;

&lt;h3&gt;&amp;hellip; and Qualitatively&lt;/h3&gt;

&lt;p&gt;The Hohenheim people and those from CommunicationLab are well aware that there is more to text difficulty than counting the occurrences of traditional indicators. Having conducted a manual examination by experts, they claim that they can sustain the quantitative results. Briefly summarized, the manifestos have been written using too much technical language as well as using uncommon foreign words without explanation. Some criteria exceed the scope of pure readability, such as the one they call &lt;em&gt;Wording&lt;/em&gt; seem to punish texts for inappropriate wording or the consideration of good and bad text layout.&lt;/p&gt;

&lt;h3&gt;Yet another Approach to approaching Language&lt;/h3&gt;

&lt;p&gt;Communication studies as an academic field looks at language from a different angle than (computational) linguistics does. Yet in the end, language remains language. Until a few weeks ago I did not even know of the existence of this field. Interestingly enough, the methodology of &lt;a href=&quot;http://en.wikipedia.org/wiki/Content_analysis&quot;&gt;content analysis&lt;/a&gt; comes in a flavor called &lt;em&gt;computational content analysis&lt;/em&gt; that uses some of the same technologies as computational linguists use and develop. CommunicationLab in Ulm therefore seems to be a spin-off company from this field that just accidentally deals with things I am also doing as a computational linguist, namely readability and automatic assessment of text difficulty.&lt;/p&gt;

&lt;p&gt;Computational content analysis will be a further topic for me to explore in the future. In the past I encountered a number of ways of dealing with language. The purist computer scientist may say that information is a sequence of symbols from a defined alphabet. In information retrieval, there is not overwhelmingly much linguistics behind the processing. In computational linguistics, some people make things so linguistically precise that nothing works out in the end. Where on the scale between naivety and sophistication concerning language will content analysis turn out to be placed?&lt;/p&gt;

&lt;h3&gt;So how does the Hohenheimer Verständlichkeitsindex work?&lt;/h3&gt;

&lt;p&gt;As reported in personal communication with &lt;a href=&quot;https://komm.uni-hohenheim.de/kercher.html&quot;&gt;Jan Kercher&lt;/a&gt; from Hohenheim University, their formula is still work in progress. They are in a phase of tuning and testing it, aiming to publish scientific work about it. For now they only reveal the &lt;a href=&quot;https://komm.uni-hohenheim.de/fileadmin/einrichtungen/komm/PDFs/Komm/Wahlprogramm-Check/Wahlprogramm-Check_Btg-Wahl_2009.pdf&quot;&gt;ingredients&lt;/a&gt; and the general procedure. There is no full recipe for baking that cake yet. So here we go with their list of variables used in the Hohenheimer Verständlichkeitsindex:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Readability measures:
  &lt;ul&gt;
     &lt;li&gt;Amstad Formula&lt;/li&gt;
     &lt;li&gt;Wiener Sachtext Formula&lt;/li&gt;
    &lt;li&gt;Simple Measure of Gobbledygook (SMOG)&lt;/li&gt;
    &lt;li&gt;Lesbarhetsindex (LIX)&lt;/li&gt;
  &lt;/ul&gt;&lt;/li&gt;
   &lt;li&gt;Other parameters:
   &lt;ul&gt;
   &lt;li&gt;Average sentence length&lt;/li&gt;
   &lt;li&gt;Average word length&lt;/li&gt;
   &lt;li&gt;Proportion of words with more than 6 characters&lt;/li&gt;
   &lt;li&gt;Proportion of embedded sentences&lt;/li&gt;
   &lt;li&gt;Proportion of sentences containing more than 20 words&lt;/li&gt;
   &lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apart from the syntax-based measures (embedded sentences), these &lt;em&gt;other parameters&lt;/em&gt; are similar to the basic ingredients in any readability measure published since &lt;a href=&quot;http://en.wikipedia.org/wiki/Rudolf_Flesch&quot;&gt;Rudolf Flesch&lt;/a&gt; made readability popular in the 1940ies. The researchers compiled a corpus of hard texts using the abstracts of dissertations in political sciences. Furthermore, they compiled a corpus of easy texts from the politics pages of the German tabloid &lt;em&gt;Bild&lt;/em&gt;. Every single variable listed above was then scaled to a value between 0 (very bad) and 10 (very good). The averages of these values for both the readability measures and the other parameters (see above) were computed. The sum of these two average figures makes the actual comprehensibility index.&lt;/p&gt;

&lt;p&gt;Now what&#039;s the difference to all those readability formulas out there, including the ones which I mentioned in previous posts and which are computed in the &lt;a href=&quot;http://www.drni.de/niels/s9y/archives/12-Text-Difficulty-and-Information-Retrieval.html&quot;&gt;Information Retrieval for Language Learning&lt;/a&gt; prototype? For me the main difference seems to be the fact that the Hohneheimer Verständlichkeitsindex is designed exclusively for the use with German texts about politics. It is likely that it will perform quite well if used solely with texts from that genre. I am looking forward to read their publications.&lt;/p&gt;

&lt;h3&gt;Outro&lt;/h3&gt;

&lt;p&gt;A new readability measure for German texts on politics is emerging from communication research. It will be interesting to see the final product and its impact. The creators of the new formula from Hohenheim University and CommunicationLab made a smart move in applying their methods of content analysis to political manifestos since this assures them a remarkable amount of public interest. The plain and simple scores produced by a readability measure do their stuff once more: there is a significant media response. While this may make others researchers jealous, only the upcoming publications will show what exactly they did and and how to evaluate it.&lt;/p&gt; 
    </content:encoded>

    <pubDate>Sat, 26 Sep 2009 18:37:00 +0200</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/13-guid.html</guid>
    
</item>
<item>
    <title>Text Difficulty and Information Retrieval</title>
    <link>http://www.drni.de/niels/s9y/archives/12-Text-Difficulty-and-Information-Retrieval.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/12-Text-Difficulty-and-Information-Retrieval.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=12</wfw:comment>

    <slash:comments>2</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=12</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;div  class=&quot;drniBlogDecoFigure&quot;&gt;&lt;object  data=&quot;http://www.drni.de/niels/n3files/cl-blog/book-magnify.svg&quot; width=&quot;225&quot; height=&quot;210&quot;
type=&quot;image/svg+xml&quot;
codebase=&quot;http://www.adobe.com/svg/viewer/install/&quot; /&gt;
&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/book-magnify.svg&quot;&gt;&lt;img
src=&quot;http://www.drni.de/niels/n3files/cl-blog/book-magnify.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;
&lt;/object&gt;
&lt;/div&gt;&lt;h3&gt;Intro&lt;/h3&gt;

&lt;p&gt;It has been a long pause here on Automatic Mind. After finishing my Master&#039;s project and Thesis, it took me some time to adjust to my new situation as a researcher here at Tübingen University. Meanwhile some things went on in the readability corner. The tool for computing readability formulas that I demonstrated as a Java applet in an &lt;a href=&quot;http://www.drni.de/niels/s9y/archives/10-Simple-Readability-Formulas-And-Boring-Preprocessing.html&quot;&gt;earlier post&lt;/a&gt; is now &lt;a href=&quot;http://www.drni.de/niels/s9y/pages/phantom.html&quot;&gt;freely available as a Java library&lt;/a&gt;&amp;ndash;including the applet and a standalone demo GUI. Some bugs have been squashed and all formulas have been cross-checked with the corresponding original publications. In this post I will focus on what one can do with those readability formulas in information retrieval. This is a brief summary of topics from my &lt;a href=&quot;http://drni.de/zap/ma-thesis&quot;&gt;MA Thesis&lt;/a&gt; entitled &lt;em&gt;Information Retrieval for Language Learning: An Exploration of Text Difficulty Measures&lt;/em&gt;. The practical part of my thesis continues living as the Information Retrieval fo Language Learning (IR4LL) project which also features an &lt;a href=&quot;http://drni.de/zap/ir4ll&quot;&gt;online demo and web site&lt;/a&gt;.&lt;/p&gt; &lt;h3&gt;Yet another Search Engine?&lt;/h3&gt;

&lt;p&gt;»There is Google, so what do you want?« Apart from the fact that it is cool and nerdy to be able to say that one has developed one&#039;s own search engine, there&#039;s other stuff to it. Google and its competitors are very good at retrieving web pages of interest. But they can of course not guarantee that the texts on the returned pages are easy to read. However, it is crucial to language learners and teachers gathering readings to have material that is at a certain level of difficulty. Otherwise students will either be bored or overstrained. It is hard to say whether or not Google could do this if they were interested. They usually focus on statistical methods and it is unlikely that they would do deep natural language processing&amp;ndash; it is simply too expensive in terms of processing power. So here we go with a new search engine that limits itsself to only a few web sites containing promising readings at several levels. At least this is the plan for the future, the &lt;a href=&quot;http://drni.de/zap/ir4ll&quot;&gt;current prototype&lt;/a&gt; shows that this is possible. Everything is work in progress.&lt;/p&gt;

&lt;h3&gt;Readability Measures and Beyond&lt;/h3&gt;

&lt;p&gt;A brief roundup on readability measures in general and on &lt;a href=&quot;http://www.drni.de/niels/s9y/archives/10-Simple-Readability-Formulas-And-Boring-Preprocessing.html&quot;&gt;my previous post&lt;/a&gt;: readability measures or formulas try to compute a single value stating how difficult a text is to read. Many formulas are supposed to yield numbers in the scale of &lt;a href=&quot;http://en.wikipedia.org/wiki/Education_in_the_United_States#School_grades&quot;&gt;U.S. grade levels&lt;/a&gt;. Most formulas use the average sentence length and the average word length as variables. Word length is measured in syllables or in characters. While the formulas look like they were containing a lot of magic numbers, it is in fact the case that their constants are usually designed to match the variables to texts with a known difficulty level.
To illustrate this, here is the Flesch-Kincaid formula which is supposed to yield levels on the grade level scale:&lt;/p&gt;

&lt;center&gt;
&lt;big&gt;Flesch-Kincaid = -15.59 + 11.8 &amp;times; &lt;i&gt;AWL&lt;sub&gt;&lt;small&gt;s&lt;/small&gt;&lt;/sub&gt;&lt;/i&gt;
+ 0.39 &amp;times; &lt;i&gt;ASL&lt;/i&gt;&lt;br /&gt;&lt;/big&gt;

where&lt;br /&gt;
&lt;i&gt;AWL&lt;sub&gt;&lt;small&gt;s&lt;/small&gt;&lt;/sub&gt;&lt;/i&gt; is the average word length counted in syllabes and &lt;br /&gt;
&lt;i&gt;ASL&lt;/i&gt; is the average sentence length counted in words.
&lt;/center&gt;

&lt;p&gt;To play with readability measures, check out the &lt;a href=&quot;http://www.drni.de/niels/n3files/phantom/latest.jnlp&quot; title=&quot;Tell your browser to open this with javaws!&quot;&gt;Java Webstart of the Phantom Demo GUI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Unfortunately, the world of language is not as simple as that. Sentence length depends on the domain. It probably produces a stronger classification for novels vs. technical manuals than for easy vs. hard texts. Word length also reveals some flaws: the general assumption that long words are harder (or even rarer, if we agree with Zipf&#039;s &lt;a href=&quot;http://www.worldcat.org/oclc/250028603&amp;referer=brief_results&amp;lang=en&quot;&gt;&lt;em&gt;The Psycho-biology of Language&lt;/em&gt;&lt;/a&gt;) is questioned by frequent long words such as &lt;em&gt;beautiful&lt;/em&gt; or &lt;em&gt;absolutely&lt;/em&gt;. While the current IR4LL implementation strongly focuses on readability measures, there are things to be explored beyond those.&lt;/p&gt;

&lt;h3&gt;Vocabulary and Grammar&lt;/h3&gt;


&lt;p&gt;For future research, there are two basic strands which I want to follow with respect to Information Retrieval for Language Learning (IR4LL):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;measures of text difficulty based on vocabulary lists, and&lt;/li&gt;
&lt;li&gt;syntax-based measures of text difficulty.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Thinking back to one&#039;s own foreign language classes in school, vocab drill is what everybody had to do. There are different styles of vocabulary learning, but there is no doubt that new words must be learned actively. Which also means that missing vocabulary makes a text harder to read. Word lists are relatively easy to deal with for computational linguists. However, the contents of those lists must be well-chosen, so this might turn out as a challenge. Furthermore, vocabulary is again domain-specific. If a learner is a fantasy literature nerd, he or she will probably be able to read a novel of that genre that even native speakers will have a hard time with.&lt;/p&gt;

&lt;p&gt;Syntactic complexity seems to be one of the more promising things to look at. 
Simple phrases linked with &lt;em&gt;and&lt;/em&gt; or commas are probably much easier to understand than deeply nested subordinate clauses. There are a couple of existing approaches  which I&#039;m planning to integrate into the IR4LL prototype. A positive side-effect of that will be the possibility to directly query for linguistic forms. One could query for texts containing a lot of gerunds, or a lot of simple past, and so on. Since most tenses are taught separately, this could reveal nice real life reading materials for classroom use. Furthermore, the &lt;a href=&quot;http://drni.de/zap/werti&quot;&gt;WERTi system&lt;/a&gt; (an automatic intelligent workbook) interacts with IR4LL already in its latest prototype version.
A syntax-aware IR4LL system could greatly improve the usability of WERTi by finding better-suited readings.&lt;/p&gt;

&lt;p&gt;At the time of writing, much of these thoughts are left to future research, which I will hope we will be able to conduct here at the linguistics department of Tübingen University.&lt;/p&gt;

&lt;h3&gt;Combining Measures into Simpler Difficulty Levels&lt;/h3&gt;

There is yet another challenge to IR4LL: most users will not be able to specify their language proficiency level in a detailed way. A self-assessment such as »yeah, I&#039;m rather weak at future in the past, but I don&#039;t have any trouble with deeply structured sentences« are unlikely. Therefore, IR4LL aims to combine several measures (text difficulty, vocab-based, syntax-based) into single templates. I call these query models because they are used as part of the search engine query. It would be great to have query models that actually classify texts into a well-known scale such as &lt;a href=&quot;http://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages&quot;&gt;CEF levels&lt;/a&gt; or to a stages in a foreign language teaching curriculum.

&lt;h3&gt;Outro&lt;/h3&gt;

&lt;p&gt;With not too much related work being out there, my MA thesis project contributes to the small field of IR4LL. The thesis discusses ways of measuring text difficulty and the implementation part provides an extensible framework with a running prototype. Future plans include the integration of a web crawler and the refinement of the text categorization into well-established difficulty levels. Once this will have been successfully approached, the search engine will be of great use to language teachers and learners. 
Will boring school book texts finally be a thing of the past? If we manage and if there is enough easy text on the web, they will.&lt;/p&gt;

&lt;div id=&quot;drniPostFootnoteSec&quot;&gt;&lt;hr id=&quot;PostFootNote&quot; /&gt;
Graphics taken from &lt;a href=&quot;http://www.openclipart.org/&quot;&gt;Open Clip Art Library&lt;/a&gt;, modified by Niels Ott.&lt;/div&gt; 
    </content:encoded>

    <pubDate>Thu, 20 Aug 2009 11:21:00 +0200</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/12-guid.html</guid>
    
</item>
<item>
    <title>CL Blogs and a New Name</title>
    <link>http://www.drni.de/niels/s9y/archives/11-CL-Blogs-and-a-New-Name.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/11-CL-Blogs-and-a-New-Name.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=11</wfw:comment>

    <slash:comments>2</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=11</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;div  class=&quot;drniBlogDecoFigure&quot;&gt;&lt;object  data=&quot;http://www.drni.de/niels/n3files/cl-blog/brainputer.svg&quot; width=&quot;225&quot; height=&quot;138&quot;
type=&quot;image/svg+xml&quot;
codebase=&quot;http://www.adobe.com/svg/viewer/install/&quot; /&gt;
&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/brainputer.svg&quot;&gt;&lt;img
src=&quot;http://www.drni.de/niels/n3files/cl-blog/brainputer.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;
&lt;/object&gt;
&lt;/div&gt;&lt;p&gt;Marveling at Jason Adam&#039;s &lt;a href=&quot;http://mendicantbug.com/2009/01/24/computational-linguistics-blogs/&quot;&gt;collection of computational linguistics blogs&lt;/a&gt;, I noticed that &lt;em&gt;CL Blog&lt;/em&gt; is a rather dull name for a blog. It somehow felt like naming a newspaper &lt;em&gt;Newspaper&lt;/em&gt;. Back then I decided that this blog needs a new name. I just renamed it to &lt;em&gt;Automatic Mind&lt;/em&gt;. The term actually is related to &lt;a href=&quot;http://en.wikipedia.org/wiki/Dual_process_theory#Duplex_model&quot;&gt;Dual Process Theory&lt;/a&gt; and refers to the fact that we can simultaneously walk and talk, or perform other tasks of which one is conscious and the other one subconscious. Then again, it also refers to computational linguistics. The human mind can process language. The computer maybe can&amp;#160;&amp;ndash; a little bit. What we need or dream of is an automatic mind. &lt;/p&gt;

&lt;p&gt;Currently I am consumed by working on my Master&#039;s Thesis so I rarely find time to read blogs, let alone writing serious posts. Please stay subscribed.&lt;/p&gt;

&lt;div id=&quot;drniPostFootnoteSec&quot;&gt;&lt;hr id=&quot;PostFootNote&quot; /&gt;
Graphics taken from &lt;a href=&quot;http://www.openclipart.org/&quot;&gt;Open Clip Art Library&lt;/a&gt;, modified by Niels Ott.&lt;/div&gt;  
    </content:encoded>

    <pubDate>Mon, 16 Feb 2009 16:05:00 +0100</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/11-guid.html</guid>
    
</item>
<item>
    <title>Simple Readability Formulas And Boring Preprocessing</title>
    <link>http://www.drni.de/niels/s9y/archives/10-Simple-Readability-Formulas-And-Boring-Preprocessing.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/10-Simple-Readability-Formulas-And-Boring-Preprocessing.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=10</wfw:comment>

    <slash:comments>5</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=10</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;div  class=&quot;drniBlogDecoFigure&quot;&gt;&lt;object  data=&quot;http://www.drni.de/niels/n3files/cl-blog/book-scales.svg&quot; width=&quot;222&quot; height=&quot;326&quot;
type=&quot;image/svg+xml&quot;
codebase=&quot;http://www.adobe.com/svg/viewer/install/&quot; /&gt;
&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/book-scales.svg.svg&quot;&gt;&lt;img
src=&quot;http://www.drni.de/niels/n3files/cl-blog/book-scales.svg.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;
&lt;/object&gt;
&lt;/div&gt;&lt;h3&gt;Intro&lt;/h3&gt;

&lt;p&gt;Readability formulas date back to the 1920s. They come in countless shapes and flavors, all sharing one common dream of their makers: to have a simple mathematical means of determining the reading difficulty of a given text. Is this text suitable as a reading for 4th-graders? Just stuff it into the formula and you will know which grade-level it fits. Of course, people put up warning sings telling the naive users out there what to do and what not to do with these formulas. But don&#039;t these formulas resemble the big dream of all natural language processing (NLP)? After all, all we want to have is something smart and simple that does the job of dealing with real world language. In this blog post, I will give a basic introduction on readability measures and I will point out in some detail that ›boring‹ preprocessing steps such as tokenization and sentence splitting are often underestimated. An interactive demo for computing readability scores is included.&lt;/p&gt; &lt;h3&gt;Two Example Formulas&lt;/h3&gt;

&lt;p&gt;One of the most popular formulas is the &lt;a href=&quot;http://en.wikipedia.org/wiki/Flesch-Kincaid&quot;&gt;Flesch Reading Ease&lt;/a&gt;, introduced by Rudolf Flesch in 1948 in his article &lt;em&gt;A New Readability Yardstick&lt;/em&gt;. A  reading ease of
90.0 to 100.0 is to indicate that the given text is very easy to read. A value in this range is achieved for comic books.
 The range of 0 to 30 indicates texts which are very hard to read. According to Flesch, such a value is achieved in scientific publications.
The formula by Flesch looks relatively plain and simple, apart from some funny magic numbers:&lt;p&gt;

&lt;center&gt;
&lt;big&gt;Reading Ease = 206.835 - .846 &amp;times; &lt;i&gt;WL&lt;/i&gt; - 1.015 &amp;times; &lt;i&gt;SL&lt;/i&gt;&lt;br /&gt;&lt;/big&gt;
where&lt;br /&gt;
&lt;i&gt;WL&lt;/i&gt; = the number of syllables per 100 words (word length)&lt;br /&gt;
&lt;i&gt;SL&lt;/i&gt; = the average sentence length
&lt;/center&gt;

&lt;p&gt;The formulation &lt;em&gt;per 100 words&lt;/em&gt; indicates a fact common to many early formulas: as computations had to be done manually, many authors advised users to work on small samples such as 100 words. Some variants of this advice suggest to take a small sample each from the beginning, middle, and the end of a text.
Kincaid and colleagues later on adapted the formula to yield grade levels of U.S. education. The general idea and  the linguistic analysis remained the same, only the magic numbers were adapted.&lt;/p&gt;

&lt;p&gt;A less popular measure is the &lt;em&gt;Läsbarhetsindex&lt;/em&gt; (Readability Index, LIX) introduced by Carl-Hugo Björnsson in 1968. ›Less popular‹ here means less popular in the English speaking part of the world. For Swedish and Danish, LIX seems to be widely used. It is unclear where the formula given in the the &lt;a href=&quot;http://sv.wikipedia.org/wiki/LIX&quot;&gt;Swedish Wikipedia article&lt;/a&gt; is taken from. The German translation &lt;em&gt;Lesbarkeit mit LIX&lt;/em&gt; (Readability with LIX) mentions several versions of one formula with differing magic numbers adapted for German and Swedish each. I need to do further research on this. The formula is commonly given as follows:&lt;/p&gt;

&lt;center&gt;
&lt;big&gt;LIX = &lt;i&gt;W&lt;/i&gt; / &lt;i&gt;P&lt;/i&gt; + (&lt;i&gt;L&lt;/i&gt; &amp;times; 100) / &lt;i&gt;W&lt;/i&gt;&lt;br /&gt;&lt;/big&gt;
where&lt;br /&gt;
&lt;i&gt;P&lt;/i&gt; = number of periods in the text or sample (lazy version of number of sentences)&lt;br/ &gt;
&lt;i&gt;W&lt;/i&gt; = number of words in the text or sample&lt;br /&gt;
&lt;i&gt;L&lt;/i&gt; = number of long words (more than 6 characters) in the text or sample
&lt;/center&gt;

&lt;p&gt;The interpretation of LIX values ranges from 25 (very easy) to 65 (very difficult). But what is more important is that LIX does not require syllable counting. Syllable counting was found to be tedious in 1968, and the addressed audience being people involved in language teaching probably did neither have access to computer machinery nor the knowledge to operate it. Nowadays, the problem is not computational power but the lack of accurate analysis. Part of this issue is discussed below.&lt;/p&gt;

&lt;p&gt;Most readability formulas look like  the ones above. Of course, there are formulas that include more intelligent analyses such as word frequency lists, or even syntactic analyses (sentence structure). As it seems, the base of most readability formulas does not stand on solid grounds. But then again, who cares about these assumptions as long as these formulas actually work? Most authors carefully restrict the validity if their formulas to a certain language and even certain text types and profiles of the readers addressed. For example, the FORCAST formula introduced 1973 by Caylor and colleagues is to be used only for U.S. Army technical documents that are read by young adult male readers.&lt;/p&gt;

&lt;h3&gt;Readability Demo&lt;/h3&gt;

&lt;p&gt;If you have Java installed on your system, you can get a feeling for a number of readability measures by using the little program below. Some ideas for input: &lt;a href=&quot;http://www.neopets.com/neopedia.phtml?neopedia_id=123&amp;criteria=&quot;&gt;Texts for children&lt;/a&gt;, &lt;a href=&quot;http://simple.wikipedia.org/wiki/Beer&quot;&gt;Normal  English Wikipedia articles&lt;/a&gt; vs. &lt;a href=&quot;http://simple.wikipedia.org/wiki/Beer&quot;&gt;Simple English Wikipedia articles&lt;/a&gt;, &lt;a href=&quot;http://portal.acm.org/citation.cfm?id=990837&quot;&gt;abstracts of scientific papers&lt;/a&gt;.
 (Use the copy-paste keyboard shortcuts in the text area of the program.)
&lt;/p&gt;

&lt;center&gt;&lt;applet align=&quot;center&quot; width=&quot;540&quot; height=&quot;478&quot;  code=&quot;de/drni/readability/demo/gui/Applet.class&quot; archive=&quot;http://www.drni.de/niels/n3files/cl-blog/phantom-applet-beta-0.0.2.jar&quot; &gt;
&lt;param name=&quot;text&quot; value=&quot;Readability formulas date back to the 1920s. They come in countless shapes and flavors, all sharing one common dream of their makers: to have a simple mathematical means of determining the reading difficulty of a given text. Is this text suitable as a reading for 4th-graders? Just stuff it into the formula and you will know which grade-level it fits. Of course, people put up warning sings telling the naive users out there what to do and what not to do with these formulas. But don&#039;t these formulas resemble the big dream of all natural language processing (NLP)? After all, all we want to have is something smart and simple that does the job of dealing with real world language. In this blog post, I will give a basic introduction on readability measures and I will point out in some detail that ›boring‹ preprocessing steps such as tokenization and sentence splitting are often underestimated. An interactive demo for computing readability scores is included.&quot;&gt;
&lt;p&gt;&lt;small&gt;(Apparently you do not have the Java plugin available in your web browser. If you do have Java installed, you can still &lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/phantom-applet-beta-0.0.2.jnlp&quot;&gt;run the demo application using Java WebStart&lt;/a&gt;. Your browser might ask you what to do with the file: tell it to open the file with &lt;code&gt;javaws&lt;/code&gt;.)&lt;/small&gt;&lt;/p&gt;
 &lt;/applet&gt;
&lt;br /&gt;&lt;small&gt;&lt;strong&gt;(This program is an old, probably incorrect version.)&lt;/strong&gt;&lt;/small&gt;
&lt;/center&gt;

&lt;p&gt;The program actually is a preview on my readability library called &lt;em&gt;Phantom Readability Library&lt;/em&gt; which will be available from the &lt;a href=&quot;http://www.drni.de/niels/s9y/pages/software-projects.html&quot;&gt;software projects section&lt;/a&gt; of my web page soon. Be aware that some measures might be incorrect. I am currently gathering the publications behind those formulas in order to check each of them for correctness.&lt;/p&gt;

&lt;h3&gt;Tokenization, Counting Sentences and Syllables&lt;/h3&gt;

&lt;p&gt;In the beginning of my readability adventure, I planned to simply use &lt;a href=&quot;http://www.representqueens.com/fathom/&quot;&gt;Java Fathom&lt;/a&gt;, a port of Perl&#039;s &lt;a href=&quot;http://search.cpan.org/~kimryan/Lingua-EN-Fathom-1.12/lib/Lingua/EN/Fathom.pm&quot;&gt;Lingua::EN::Fathom&lt;/a&gt; available from &lt;a href=&quot;http://en.wikipedia.org/wiki/CPAN&quot;&gt;CPAN&lt;/a&gt;. I played with some texts and &lt;a href=&quot;http://www.neopets.com/neopedia.phtml?neopedia_id=123&amp;criteria=&quot;&gt;a story for children&lt;/a&gt; computed to a Flesch-Kincaid grade-level of above 13. Clearly, something must have gone wrong. After a short journey into the code I found out, that the sentence counter could not deal with direct speech as used in that story. As mentioned above, most formulas punish texts for having long sentences. Having fixed the issue, the grade-level now computes to 2.6 for the same text (in the current version of the program mentioned above).&lt;/p&gt;

&lt;p&gt;A former student colleague kindly provided me with her version of a Perl-based syllable counter, which I then ported to Java. She reports over 96% correct results on a large test data set, which is fairly good for a rule-based approach that operates directly on English spelling. Having that tool available, I compared my results to those of other tools, mostly those &lt;a href=&quot;http://www.online-utility.org/english/readability_test_and_improve.jsp&quot;&gt;being available online&lt;/a&gt;.  They gave different results on measures with syllable counting. I found out that an average syllables-per-word ratio of 1.5 vs 1.2 (my program) affects the readability measures a lot.&lt;/p&gt;

&lt;p&gt;So what&#039;s the message? What I am trying to say here is that the preprocessing accuracy does matter. The example with the sentence counting shows that it even matters a lot. And this is what happens to me all the time: after having implemented any analysis component beyond tokenization and sentence splitting, I always find out that due to some errors during preprocessing, all later steps fail to a large extend. Preprocessing is the most underestimated thing  in NLP! People tend to think about it as a solved problem, but in fact and in practice, it is still one of the biggest challenges we face.&lt;p&gt;

&lt;p&gt;Sometimes, as with the Fathom module, it is just people not being imaginative enough. Having &lt;code&gt;[a-zA-Z]&lt;/code&gt; as the only legal characters of a word reveals an anglocentric computer scientist&#039;s view on a world without any other language than English and without foreign words being used.&lt;/p&gt;

&lt;h3&gt;The Use and Abuse of Readability Formulas&lt;/h3&gt;

&lt;p&gt;So what are people doing with these formulas? First of all, there are offers of commercial tools with remarkable prices. That does not say these products are actually selling, though. William H. DuBay writes in his accessible overview &lt;a href=&quot;http://www.impact-information.com/impactinfo/readability02.pdf&quot;&gt;&lt;em&gt;The Principles of Readability&lt;/em&gt;&lt;/a&gt; that the average adult citizen of the U.S. reads at the 7th grade level. A text written at the 10th grade level will not be understood by 80% of the U.S. population. In genres where a broad audience is addressed, such as manuals, health care, or government information, checking the texts with readability formulas can give hints on where to improve the communication. Apart from that, publishers may want to decrease the reading difficulty of their books or newspapers in order to reach a larger number of customers. However, they might scare away readers with higher reading proficiency as reading a lot of text below one&#039;s level can be boring or even exhausting.&lt;/p&gt;

&lt;p&gt;Language teachers could in theory use readability formulas to judge whether or not a text is suitable for their students.&lt;/p&gt;

&lt;p&gt;One issue concerning readability formulas is that one must not ›write to the formula‹. Quite naturally, one would trick the formula by e.g. using shorter words or shorter sentences. Which leads to texts with shorter sentences that are not necessarily easier. Readability formulas simply do not work the other way round.&lt;p&gt;

&lt;p&gt;A common misconception may be that these formulas are supposed to be an exact means of judging the reading difficulty of a text. However, they do not produce much more than ballpark figures. Their exactness heavily depends on the suitability of the text for the formula in use. Furthermore, the implementations in computer programs may differ widely, depending on the quality of preprocessing as discussed above and the linguistic conceptions of their makers. Christian Watson discusses the differences in the results of several online tools in his &lt;a href=&quot;http://www.smileycat.com/miaow/archives/000875.php&quot;&gt;Smiley Cat Web Design Blog&lt;/a&gt;. Last but not least, there is the possibility to write difficult text with easy words and easy sentences, simply because they deal with a topic hardly anyone is familiar with. This issue is partly addressed by formulas using vocabulary frequency lists&amp;hellip; another story to write a blog post about.&lt;/p&gt;


&lt;h3&gt;Outro&lt;/h3&gt;

&lt;p&gt;Readability formulas are one of the things in computational linguistics that work well under certain conditions without requiring complete and complex analyses of language. However, these are rather shallow heuristics that may fail for a large number of texts. If they are used in the wrong context, they will almost certainly fail. What makes them attractive is their simplicity&amp;#160;&amp;ndash; which at the same time bares the danger of being blinded by its beauty, leading to the overuse or abuse of the formulas. From the computational perspective, these formulas are fragile because they depend on preprocessing analyses such as sentence splitting and syllable counting.  The success of any analysis component stands and falls with preprocessing. We as CL people should not forget about that simple fact. The helpfulness of readability measures in computer-aided language learning (CALL) will be  subject of further analysis and discussion in my master&#039;s thesis.&lt;/p&gt;

&lt;div id=&quot;drniPostFootnoteSec&quot;&gt;&lt;hr id=&quot;PostFootNote&quot; /&gt;
Graphics taken from &lt;a href=&quot;http://www.openclipart.org/&quot;&gt;Open Clip Art Library&lt;/a&gt;, modified by Niels Ott.&lt;/div&gt; 
    </content:encoded>

    <pubDate>Fri, 23 Jan 2009 14:57:00 +0100</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/10-guid.html</guid>
    
</item>
<item>
    <title>The USES Issue</title>
    <link>http://www.drni.de/niels/s9y/archives/5-The-USES-Issue.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/5-The-USES-Issue.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=5</wfw:comment>

    <slash:comments>13</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=5</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;div  class=&quot;drniBlogDecoFigure&quot;&gt;&lt;object  data=&quot;http://www.drni.de/niels/n3files/cl-blog/trash-diskette.svg&quot; width=&quot;202&quot; height=&quot;258&quot;
type=&quot;image/svg+xml&quot;
codebase=&quot;http://www.adobe.com/svg/viewer/install/&quot; /&gt;
&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/trash-diskette.svg&quot;&gt;&lt;img
src=&quot;http://www.drni.de/niels/n3files/cl-blog/trash-diskette.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;
&lt;/object&gt;
&lt;/div&gt;&lt;h3&gt;Intro&lt;/h3&gt;

&lt;p&gt;It is hard to term the phenomenon without offending someone. Good names would be &lt;em&gt;Scienceware&lt;/em&gt;, or &lt;em&gt;Guruware&lt;/em&gt;, or even better &lt;em&gt;Scientistware&lt;/em&gt;. They are all taken by companies or other institutions that presumably all do a way too good job to provide a name for a negative aspect. So let me call it USES for &lt;em&gt;Unsustainable Software Emerging from Science&lt;/em&gt;. This blog post shall shed some light onto the issues of USES and onto possible reasons.&lt;/p&gt; &lt;h3&gt;What USES is All About&lt;/h3&gt;

&lt;p&gt;As a computational linguist, I am working with specialized software each and every day. May it be part-of-speech taggers, tools to explore corpora or treebanks, or simply software development tools such as compilers, or even integrated development environments. But there is one type of software standing out: &lt;em&gt;Software that emerged from Science&lt;/em&gt; (USES). Common features of this type of program include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Usually developed for a very special and highly sophisticated purpose that is only understood within the field.&lt;/li&gt;
&lt;li&gt;Developed by a single person who is a major expert within this field.
&lt;li&gt;The major expert developing the software often is not a major expert in neither software design nor software architecture.&lt;/li&gt;
&lt;li&gt;Examples of file formats are rare and there is little to no documentation about the software.&lt;/li&gt;
&lt;li&gt;The software reacts unexpectedly on certain types of input, e.g. it ignores syntax mistakes in grammar files and then malfunctions without telling users why.&lt;/li&gt;
&lt;li&gt;The software is often not completely finished and and includes some missing bridges at the end of some roads without any warning signs.&lt;/li&gt;
&lt;li&gt;The only person who knows how to work with it is the major expert in the field who is not the major expert in writing usable software.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since I am not intending to offend anybody, let me give an anonymous example. In a paper about parsing noun phrases with a certain parser, it is written that a single day was spent on writing the slightly over 100 grammar rules in use. A number of source code examples garnish the publication and it seems to describe a mighty piece of software that performs well and is easy to operate. But this opinion can change rapidly once one starts using the software. The grammar parser is not safe enough to deal with missing semicolons. Sometimes it notices them, reporting an error 20 or 30 lines before or after, some other times it just ignores the issue and interprets something the grammar writer did not intend at all&amp;#160;&amp;ndash; without saying so.&lt;/p&gt;

&lt;p&gt;I am sorry to say that this behavior&amp;#160;&amp;ndash; which is only one example of bad application behaviour&amp;#160;&amp;ndash; is shared by a number of applications I have been using so far. In all cases, documentation was sparse and I spent a number of days or weeks on trial and error procedures.&lt;/p&gt;

&lt;h3&gt;The Lack of Documentation&lt;/h3&gt;

&lt;p&gt;Why is there a lack of documentation of USES? Here are my hypotheses: science as a system rewards publications, may they be books or&amp;#160;&amp;ndash; probably even more important for most authors&amp;#160;&amp;ndash; papers accepted at conferences or by journal boards. In papers, people report about the great insights the gained. Of course, these insights were gained by employing USES. Which is alright. However, there are three things that are not rewarded by this system: 1) the free availability of the described software to other researchers, 2) the free availability of the data required for the described experiments, such as corpora, grammars, or other computational resources, 3) the existence and free availability of reasonably good documentation for the described software.&lt;/p&gt;

&lt;p&gt;Authors should be encouraged to consider these three points. If they are not fulfilled, other researchers can neither confirm nor refute the results published in the corresponding papers. Which is, I hope not only to my opinion, what a large part of the science business should be about.&lt;/p&gt;

&lt;h3&gt;The Lack of Quality&lt;/h3&gt;

&lt;p&gt;With quality I refer to the engineering part of software: it must be stable, usable and not too complicated to install and maintain. I am not referring to  the actual purpose of the software. The LaTeX typesetting system is a good example: its output is regarded to be some of the most properly typeset books and papers out there&amp;#160;&amp;ndash; but most people writing books and papers might find its programming-alike user interface simply not usable at all. Imagine Microsoft Word being »user-friendly« in the same way: it simply would not sell.&lt;/p&gt;

&lt;p&gt;But why is this so hard? One possible answer: it takes time. Plenty of time. Designing a graphical user interface (GUI) is said to take 60% of a project&#039;s time and therefore costs. Now USES usually does not include GUIs, but every programmer knows how tedious it is to catch all errors and produce intelligent error messages to the user. Again, meaningful error messages and a good user interface are not rewarded in the world of publication-based science. They steal researchers&#039; time, and so they better do without them.&lt;/p&gt;

&lt;h3&gt;The Lack of Design and Architecture&lt;/h3&gt;

&lt;p&gt;There is an even more important type of quality. The quality of the engine under the hood. The properties of those parts the car driver does not even know that they exist:&lt;/p&gt;

&lt;p&gt;Knowledge in software design is a skill that many programmers in scientific business must do without. Even worse, it is regarded as a superfluos overhead of work. Its absence explains the lack of what software designers call &lt;em&gt;the -ilities&lt;/em&gt;, denoting properties of code such as reusability, scalability, manageability, reliability, sustainability,&amp;#160;&amp;hellip;, Features one typically finds in enterprise software.  Features that are not honored by the system of publications. An important side-effect of software design is that it allows the coordination of software development in a team. Without software design, this can be quite hard, depending on the size of the project and the team. Without a team, software becomes as idiosyncratic as USES tends to be.&lt;/p&gt;

&lt;p&gt;One step further from software design, one finds software architecture. It is usually pattern-based. A pattern sketches a common problem and its solution. One could see it as a template solution or recipie to a given problem. These patterns are documented well. Using them can easy communication among developers. If a comment in the source code of a program reads »Using the Observer-Pattern here«, everybody with knowledge about software architecture does not need any further explanation on what is going on. This can simplify the development in a team or the takeover by a new maintainer of the project.&lt;/p&gt;

&lt;h3&gt;The Lack of Completeness and Maintainance&lt;/h3&gt;

&lt;p&gt;Most USES is either incomplete or many years old. If it has been written in rather  machine-oriented programming languages such as C or C++, it is often hard or impossible to get USES running on an up-to-date operating system. Why is this? So far I bombarded the system of publications with criticism. But there is another issue: science often works project-based. A project proposal is written, hopefully a grant is given, and then the project is worked on. At some point, the project is over. This usually happens way before the USES product has reached a state of completeness. Researchers are then forced to move on to other projects and the old program lies there somewhere on the Web server, becomes old and grows gray hair and is rendered unusable by time bringing changes to computer platforms and file formats. Bugs are detected by users but they are not documented on a central bug tracking system and as the development period is over, nobody will ever fix them.&lt;/p&gt;

&lt;h3&gt;Outro&lt;/h3&gt;

&lt;p&gt;In this blog post I have describe the issues of software written by scientists. This is not to offend programmers out there, but the problem must be addressed.  Good quality software is likely to quicken interest in your work in other researchers and students. It is likely to improve the gain of knowledge in computational scientific disciplines in general as it enables real reviews. Furthermore,  good quality software has the potential of supporting good teaching instead of leaving students sitting madly frustrated in computer rooms.&lt;/p&gt;

&lt;p&gt;One question remains: how can we reward people in science avoiding USES?&lt;/p&gt;

&lt;div id=&quot;drniPostFootnoteSec&quot;&gt;&lt;hr id=&quot;PostFootNote&quot; /&gt;
Graphics taken from &lt;a href=&quot;http://www.openclipart.org/&quot;&gt;Open Clip Art Library&lt;/a&gt;, modified by Niels Ott.&lt;/div&gt;

&lt;h3&gt;Addenda&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;2008-12-05:&lt;/strong&gt; Jochen Leidner pointed me to a readable article that discusses the same issue with a lot more analytic expertise. Read &lt;a href=&quot;http://www.d.umn.edu/~tpederse/Pubs/pedersen-last-word-2008.pdf&quot;&gt;&lt;em&gt;Empiricism is Not a Matter of Faith&lt;/em&gt;&lt;/a&gt; by Ted Pedersen.&lt;/li&gt;
&lt;/ul&gt; 
    </content:encoded>

    <pubDate>Tue,  2 Dec 2008 17:31:00 +0100</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/5-guid.html</guid>
    
</item>
<item>
    <title>Explaining Linguistics with Physics</title>
    <link>http://www.drni.de/niels/s9y/archives/7-Explaining-Linguistics-with-Physics.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/7-Explaining-Linguistics-with-Physics.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=7</wfw:comment>

    <slash:comments>5</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=7</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;div  class=&quot;drniBlogDecoFigure&quot;&gt;&lt;object  data=&quot;http://www.drni.de/niels/n3files/cl-blog/appletom.svg&quot; width=&quot;200&quot; height=&quot;207&quot;
type=&quot;image/svg+xml&quot;
codebase=&quot;http://www.adobe.com/svg/viewer/install/&quot; /&gt;
&lt;a href=&quot;http://www.drni.de/niels/n3files/cl-blog/appletom.svg&quot;&gt;&lt;img
src=&quot;http://www.drni.de/niels/n3files/cl-blog/appletom.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;
&lt;/object&gt;
&lt;/div&gt;&lt;h3&gt;Intro&lt;/h3&gt;

&lt;p&gt;Recently, I was asked by a student of languages what linguistics is. She was a student on the MA level, yet in the old German &lt;em&gt;Magister&lt;/em&gt; system and her major subject was&amp;#160;&amp;ndash; as far as I recall&amp;#160;&amp;ndash; German, which includes some obligatory courses on linguistics at our University here. A simple question but not so the answer. I have been struggling for years now to find a short and easy to understand and not too wrong explanation of what computational linguistics is about, now how about linguistics? I tried it with a physics explanation which I would like to present for discussion here.&lt;/p&gt; &lt;h3&gt;The World of Models&lt;/h3&gt;

&lt;p&gt;In physics as the common man may imagine it, people put up models. These models describe how phenomena work in nature and how one can predict the outcome of a process or experiment. However, they do not describe what these phenomena actually are. Take light. In physics, one can use particle theory to describe  the behavior of light. Or wave  theory. Which one to choose? This depends on the scenario. For predicting the diffraction behavior of a light beam, one would use wave theory. For explaining how solar cells work, one needs a description of the photovoltaic effect and therefore particle theory. This perspective of &lt;em&gt;wave-particle duality&lt;/em&gt; can be extended to all matter and energy. Still, it does not state what light, matter, or energy really are. It helps physicians to describe (and predict) phenomena in nature. These descriptions will always be incomplete.&lt;/p&gt;

&lt;p&gt;Linguists are doing a very similar thing. They are trying to find models that describe the human language. In morphology, they have a model that describes how to build plural word forms from base forms (lemmas). In syntax, there are formal descriptions that model the combination of words in a way that produces well-formed (grammatical) sentences. Yet no claim is made about what language actually is. And yet again, there are several models explaining different facets of a given phenomenon. And again, all models will be incomplete. Some researchers are working on the connection of their models and the processing of language in  the human brain. Others do not care about this connection as long as their models describe a certain phenomenon correctly.&lt;/p&gt;

&lt;h3&gt;Outro&lt;/h3&gt;

&lt;p&gt;Now I am neither a physicist nor a real linguist. For the first, taking science as a major in German &lt;em&gt;Gymnasium&lt;/em&gt; (secondary school) is not at all a qualification, and for the second the basic courses I took during my BA studies are only a little piece of a qualification. What do people think about the above idea of an explanation? One point I immediately came up with is that most people do not know enough about physics to make any connections, leaving them with more questions than they had before. Simplifying the physics model to, say, gravity and the dropping of Newtons apple would not bring in the plurality of models for explaining one and the same phenomenon.&lt;/p&gt;

&lt;p&gt;Why should we be concerned with this explanation business anyways? Well, I think we need it. May it be to explain to our friends why we are doing these crazy things, or may it be to explain to sponsors why they should give is money for it. And while we discuss an explanation of linguistics that has mass appeal, others might take the challenge on finding a similar thing for computational linguistics.&lt;/p&gt;

&lt;div id=&quot;drniPostFootnoteSec&quot;&gt;&lt;hr id=&quot;PostFootNote&quot; /&gt;
Graphics taken from &lt;a href=&quot;http://www.openclipart.org/&quot;&gt;Open Clip Art Library&lt;/a&gt;, modified by Niels Ott.&lt;/div&gt;
 
    </content:encoded>

    <pubDate>Wed, 12 Nov 2008 10:07:00 +0100</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/7-guid.html</guid>
    
</item>
<item>
    <title>Retrieving CL Publications Quickly</title>
    <link>http://www.drni.de/niels/s9y/archives/6-Retrieving-CL-Publications-Quickly.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/6-Retrieving-CL-Publications-Quickly.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=6</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=6</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;p&gt;There are plenty of journals, the library catalogue is huge, and time is short. In the 90ies one would have thought about a meta search engine. Now in 2008 we have Google doing it for us. How often did you google the title of a paper you just found cited in another paper? I did so quite often and it never gave me the desired paper as such. Until I created  own Google Custom Search, &lt;a href=&quot;http://www.google.com/coop/cse?cx=000302137693213122178:ycoqke386yy&quot;&gt;Publications in Computational Linguistics&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;This search engine looks at only a small but important part of all the web. I tried to include sites that host scientific papers as PDF. If you feel that an important site is missing, let me know. If you are using iGoogle, you can install &lt;a href=&quot;http://fusion.google.com/add?moduleurl=http%3A//www.google.com/coop/api/000302137693213122178/cse/ycoqke386yy/gadget&quot;&gt;this widget&lt;/a&gt; to have the specific search box on your start page.&lt;/p&gt;

&lt;p&gt;The intended use of &lt;a href=&quot;http://www.google.com/coop/cse?cx=000302137693213122178:ycoqke386yy&quot;&gt;Publications in Computational Linguistics&lt;/a&gt; is that the user enters the full title of the desired paper into the query box. Give it a try right away:&lt;/p&gt;


&lt;center&gt;&lt;form action=&quot;http://www.google.com/cse&quot; id=&quot;cse-search-box&quot;&gt;
  &lt;div&gt;
    &lt;input type=&quot;hidden&quot; name=&quot;cx&quot; value=&quot;000302137693213122178:ycoqke386yy&quot; /&gt;
    &lt;input type=&quot;hidden&quot; name=&quot;ie&quot; value=&quot;UTF-8&quot; /&gt;
    &lt;input type=&quot;text&quot; name=&quot;q&quot; size=&quot;31&quot; /&gt;
    &lt;input type=&quot;submit&quot; name=&quot;sa&quot; value=&quot;Search&quot; /&gt;
  &lt;/div&gt;
&lt;/form&gt;
&lt;script type=&quot;text/javascript&quot; src=&quot;http://www.google.com/coop/cse/brand?form=cse-search-box&amp;lang=en&quot;&gt;&lt;/script&gt;&lt;/center&gt;

&lt;p&gt;For those recognizing that this search engine is an &lt;a href=&quot;http://www.drni.de/blog/archives/523-Eine-Suchmaschine-fuer-Publikationen-im-Bereich-der-Computerlinguistik.html&quot;&gt;old story&lt;/a&gt; that has now been updated translated to English: of course, you are right.&lt;/p&gt;

&lt;p&gt;Concerning other blogging activities right here in this blog: there are some things going on in the internal drafts section. I am yet undecided how formal all of this should be and how the targeted audience may be like. So stay tuned via &lt;a href=&quot;http://www.drni.de/niels/s9y/feeds/index.rss2&quot;&gt;RSS&lt;/a&gt;, it may be  worth it.&lt;/p&gt; 
    </content:encoded>

    <pubDate>Wed, 15 Oct 2008 13:18:00 +0200</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/6-guid.html</guid>
    
</item>
<item>
    <title>An Overhaul and a Brand New Start</title>
    <link>http://www.drni.de/niels/s9y/archives/4-An-Overhaul-and-a-Brand-New-Start.html</link>
            <category>Automatic Mind</category>
    
    <comments>http://www.drni.de/niels/s9y/archives/4-An-Overhaul-and-a-Brand-New-Start.html#comments</comments>
    <wfw:comment>http://www.drni.de/niels/s9y/wfwcomment.php?cid=4</wfw:comment>

    <slash:comments>0</slash:comments>
    <wfw:commentRss>http://www.drni.de/niels/s9y/rss.php?version=2.0&amp;type=comments&amp;cid=4</wfw:commentRss>
    

    <author>nospam@example.com (Niels Ott)</author>
    <content:encoded>
    &lt;p&gt;Welcome to my completely overhauled webpage. If you have been here before, some of the contents will be still very familiar. The design is entirely new and still work in progress. But what about this new subtitle, »Me and Myself and CL«? I decided to split activities and content on the web in private and something like professional. As a result, this very webpage is concerned with me and myself as a computational linguist. And as a result of this result, there is a brand new weblog which I simply call »CL Blog«. You are reading this blog right now.&lt;/p&gt;

&lt;p&gt;The new blog will deal with CL issues only. My &lt;a href=&quot;http://www.drni.de/blog/&quot;&gt;old German blog&lt;/a&gt; will remain the blather dustbin for my private activities&amp;#160;&amp;ndash; CL excluded. Here, I am planning to write much less frequently than on the other blog but much more focused on CL topics, mostly from my experience as a student, part time student assistant in the field, and prospective scientist (hopefully).&lt;/p&gt;

&lt;p&gt;Feel free to subscribe using the RSS links at the very bottom of the page. Stay tuned for more computational linguistics!&lt;/p&gt;  
    </content:encoded>

    <pubDate>Fri, 12 Sep 2008 20:06:07 +0200</pubDate>
    <guid isPermaLink="false">http://www.drni.de/niels/s9y/archives/4-guid.html</guid>
    
</item>

</channel>
</rss>