Niels Ott

Computational Linguist

Automatic Mind

Named Entity list from German Wikipedia

Wednesday, February 3. 2010 • Category: Automatic MindComment (1)Trackbacks (0)

In 2008, I developed an anagram solver that is still presented online in some German posts of my other blog. It actually is an anagram solver but it comes as a riddle solver for a stupid type of TV quizzes. To cut a long story short: do never ever call these numbers on the screen. If you just want to know the solution, use the linked pages and turn off your TV right after.

So what's interesting about anagram solvers and these riddles? Well, one needs a fairly large word list. I took the one from the Ispell dictionary, the one I have it online on the page of BananaSplit. I soon found out that these riddles made use of celebrity names a lot. Yet my word list back then contained only a few named entities. I solved this issue by Wikipedia harvesting. What do all humans have in common? Right, they are born at some point. And of course, everybody has to die. Now Wikipedia has these births and deaths lists in the entry for each year of the calendar. Back in May 2008, I downloaded all pages from year 1000 to year 2008 and simply extracted all the persons that had ever been born or that had died. This gave me a list of 51.353 person names from German Wikipedia.

Recently a colleague asked me if he could have this list. So why not simply putting it online? Here we go.

  • Person name list from Wikipedia, CSV (bz2 compressed, 0,55MB) – downloaded in May 2008.
  • Perl-Script used for extracting the named entities, in case you want to do this yourself.

I hope this is useful for some people out there. Feel free to let me know what you're using it for or what improvements could be made. You can use the comment box below or write me an e-mail.

Ärger mit der Örgele

Tuesday, January 26. 2010 • Category: Automatic MindComments (4)Trackbacks (0)

Apologies to my English speaking readers for the rest of this article being written in German. It simply seems odd to formulate complaints about a southern German newspaper article dealing with a southern German dialect in English. Stay tuned, more general stuff will come in English as usual in the future.

»Aus Lorch ereilt uns die überraschende Nachricht, dass es ein schwäbisches Wort für den Bohrfutterschlüssel gibt. Es lautet Ärgele.« So beginnt eine weitere Ausgabe der Kolumne Schwäbisch auf Anfrage von Henning Petershagen, abgedruckt wie üblich im samstäglichen Magazin der Südwest Presse, genauer am 9. Januar 2010. Leider gibt es den diskussionswürdigen Beitrag beim Verlag nicht online, weswegen ich das Risiko eingegangen bin, eine Kopie zur Verfügung zu stellen. In diesem Post geht es auch nicht darum, die Holzmedien anzugreifen, vielmehr wurde beim Lesen der Kolumne mein Stefanowitsch-Reflex getriggert: Wenn sich Leute in der Öffentlichkeit über Sprache äußern und dabei Grundlagen der Sprachwissenschaft übersehen, dann muss das doch von jemandem kommentiert werden. Und da bietet sich ein Blog-Post doch geradezu an. Doch der Reihe nach…

Continue reading "Ärger mit der Örgele"

Hohenheimer Verständlichkeitsindex

Saturday, September 26. 2009 • Category: Automatic MindComment (1)Trackbacks (2)

Intro

Tomorrow we are going to have elections to the Bundestag here in Germany. Democracies come in all kinds of flavors. In the German flavor, the Bundestag is the parliament superordinate to all federal states (Bundesländer). The Bundestag elections are most important ones in Germany. Traditionally, parties are presenting manifestos. Wikipedia does not miss the point by claiming that »in recent decades the status of electoral manifestos has diminished somewhat due to a significant tendency for winning parties to, following the election, either ignore, indefinitely delay, or even outright reject manifesto policies which were popular with the public.« Manifestos are promises that are seldom kept. Even worse, they do not even seem to be written for a public audience. At least this is what the results of an investigation based on content analysis and readability measures suggest. In this blog post I am trying to present how researchers from Hohenheim University and a private company in Ulm are using readability to assess the understandability of manifestos.

Continue reading "Hohenheimer Verständlichkeitsindex"

Text Difficulty and Information Retrieval

Thursday, August 20. 2009 • Category: Automatic MindComments (2)Trackbacks (2)

Intro

It has been a long pause here on Automatic Mind. After finishing my Master's project and Thesis, it took me some time to adjust to my new situation as a researcher here at Tübingen University. Meanwhile some things went on in the readability corner. The tool for computing readability formulas that I demonstrated as a Java applet in an earlier post is now freely available as a Java library–including the applet and a standalone demo GUI. Some bugs have been squashed and all formulas have been cross-checked with the corresponding original publications. In this post I will focus on what one can do with those readability formulas in information retrieval. This is a brief summary of topics from my MA Thesis entitled Information Retrieval for Language Learning: An Exploration of Text Difficulty Measures. The practical part of my thesis continues living as the Information Retrieval fo Language Learning (IR4LL) project which also features an online demo and web site.

Continue reading "Text Difficulty and Information Retrieval"

CL Blogs and a New Name

Monday, February 16. 2009 • Category: Automatic MindComments (2)Trackbacks (0)

Marveling at Jason Adam's collection of computational linguistics blogs, I noticed that CL Blog is a rather dull name for a blog. It somehow felt like naming a newspaper Newspaper. Back then I decided that this blog needs a new name. I just renamed it to Automatic Mind. The term actually is related to Dual Process Theory and refers to the fact that we can simultaneously walk and talk, or perform other tasks of which one is conscious and the other one subconscious. Then again, it also refers to computational linguistics. The human mind can process language. The computer maybe can – a little bit. What we need or dream of is an automatic mind.

Currently I am consumed by working on my Master's Thesis so I rarely find time to read blogs, let alone writing serious posts. Please stay subscribed.


Graphics taken from Open Clip Art Library, modified by Niels Ott.

Simple Readability Formulas And Boring Preprocessing

Friday, January 23. 2009 • Category: Automatic MindComments (5)Trackback (1)

Intro

Readability formulas date back to the 1920s. They come in countless shapes and flavors, all sharing one common dream of their makers: to have a simple mathematical means of determining the reading difficulty of a given text. Is this text suitable as a reading for 4th-graders? Just stuff it into the formula and you will know which grade-level it fits. Of course, people put up warning sings telling the naive users out there what to do and what not to do with these formulas. But don't these formulas resemble the big dream of all natural language processing (NLP)? After all, all we want to have is something smart and simple that does the job of dealing with real world language. In this blog post, I will give a basic introduction on readability measures and I will point out in some detail that ›boring‹ preprocessing steps such as tokenization and sentence splitting are often underestimated. An interactive demo for computing readability scores is included.

Continue reading "Simple Readability Formulas And Boring Preprocessing"

The USES Issue

Tuesday, December 2. 2008 • Category: Automatic MindComments (13)Trackbacks (0)

Intro

It is hard to term the phenomenon without offending someone. Good names would be Scienceware, or Guruware, or even better Scientistware. They are all taken by companies or other institutions that presumably all do a way too good job to provide a name for a negative aspect. So let me call it USES for Unsustainable Software Emerging from Science. This blog post shall shed some light onto the issues of USES and onto possible reasons.

Continue reading "The USES Issue"

Explaining Linguistics with Physics

Wednesday, November 12. 2008 • Category: Automatic MindComments (5)Trackbacks (0)

Intro

Recently, I was asked by a student of languages what linguistics is. She was a student on the MA level, yet in the old German Magister system and her major subject was – as far as I recall – German, which includes some obligatory courses on linguistics at our University here. A simple question but not so the answer. I have been struggling for years now to find a short and easy to understand and not too wrong explanation of what computational linguistics is about, now how about linguistics? I tried it with a physics explanation which I would like to present for discussion here.

Continue reading "Explaining Linguistics with Physics"

Retrieving CL Publications Quickly

Wednesday, October 15. 2008 • Category: Automatic MindComments (0)Trackbacks (0)

There are plenty of journals, the library catalogue is huge, and time is short. In the 90ies one would have thought about a meta search engine. Now in 2008 we have Google doing it for us. How often did you google the title of a paper you just found cited in another paper? I did so quite often and it never gave me the desired paper as such. Until I created own Google Custom Search, Publications in Computational Linguistics.

Continue reading "Retrieving CL Publications Quickly"

An Overhaul and a Brand New Start

Friday, September 12. 2008 • Category: Automatic MindComments (0)Trackbacks (0)

Welcome to my completely overhauled webpage. If you have been here before, some of the contents will be still very familiar. The design is entirely new and still work in progress. But what about this new subtitle, »Me and Myself and CL«? I decided to split activities and content on the web in private and something like professional. As a result, this very webpage is concerned with me and myself as a computational linguist. And as a result of this result, there is a brand new weblog which I simply call »CL Blog«. You are reading this blog right now.

The new blog will deal with CL issues only. My old German blog will remain the blather dustbin for my private activities – CL excluded. Here, I am planning to write much less frequently than on the other blog but much more focused on CL topics, mostly from my experience as a student, part time student assistant in the field, and prospective scientist (hopefully).

Feel free to subscribe using the RSS links at the very bottom of the page. Stay tuned for more computational linguistics!