Niels Ott

Computational Linguist
BananaSplit

Dictionary-based Compound Splitter for German

BananaSplit is a compound splitter for German that uses a dictionary resource. The dictionary can be either a simple word list, or a word list equipped with POS values, or an XML based dictionary. The original version was able to use GermaNet as a dictionary. This is useful in applications that rely on GermaNet anyway: no additional lexicon needs to be generated and held in memory. This was also the original purpose of BananaSplit. It served as a compound splitter for a tool called BananaRelation.

BananaRelation cannot be published here as it makes heavy use of unpublished code by EML Research, Heidelberg. BananaSplit can either be used as a standalone application or it can be integrated into other Java programs (as a library).

This program emerged from the seminar Lexical Semantic Processing in NLP (winter term 2005/2006) taught by Iryna Gurevych at the Seminar für Sprachwissenschaft, Tübingen. Both BananaSplit and BananaRelation were introduced to the seminar participants on 17th of December, 2005.

The key algorithm for compound splitting is based on Langer (1998). The program came to use in Müller and Gurevych (2006). Please note that the program splits compounds into two parts only. Details are given in the documents linked below.

Theoretical Background

As mentioned earlier, this program emerged from a seminar. Therefore the theoretical background was presented in this seminar as a talk. Additionally there is a homework task describing the evaluation process and results. Both works do not fullfill the requirements of scientific publications but they will give you an idea about the concepts behind the tool.

  • Slides of the talk »Measuring Semantic Relatedness of German Compounds using GermaNet«, given during winter term 2005/2006.
  • The homework paper entitled »Evaluation of the BananaSplit Compound Splitter«.

Apart from the resources named above, there is JavaDoc code in the sources which should give you a lot of technical explanations.

Preview

Take a look at this screen shot.

Dictionary Resources: GermaNet and ispell

As mentioned above, the original version of BananaSplit used GermaNet via the GermaNet API. While using GermaNet as a dictionary makes sense in many situations the API is not open source in the sense of free speech. Therefore from release 0.3.1 on, BananaSplit does not support GermaNet any more. However, there is good news: Rumour has it that a new GermaNet API is currently under construction. Assuming this new API will be open source, GermaNet support will soon be back in BananaSplit.

BananaSplit supports the vertical dictionary format. This means you have one lemma per line and optionally a \t character followed by the POS. Additionally, the same information can be encoded in the XML format used by java.util.Properties. A lemma dictionary based on the igerman98 ispell dictionary by Björn Jacke is available in this format (see below).

Using and Testing

Run
java -jar banana-split-0.4.0.jar igerman98_all.xml
on the command line. This will automatically load the BananaSplit test program allowing you to type in words directly.

Download Programs, Sources and Resources

Please be aware that this program is released under the Apache License v.2. since release 0.4.0

Version History

References

Langer, Stefan (1998). Zur Morphologie und Semantik von Nominalkomposita. In Tagungsband der 4. Konferenz zur Verarbeitung naturlicher Sprache , KONVENS, pp.83-97. [Citeseer]

Christof Müller and Iryna Gurevych (2006). Exploring the Potential of Semantic Relatedness in Information Retrieval In Martin Schaaf and Klaus-Dieter Althoff (eds). LWA 2006 Lernen - Wissensentdeckung - Adaptivität, 9.-11..10.2006 in Hildesheim Hildesheimer Informatikberichte, Universität Hildesheim, pp. 126-131 [Abstract & Paper]

Posted by Niels Ott • 2009-02-19