lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pasquale Imbemba <>
Subject Re: Analysis/tokenization of compound words
Date Sat, 23 Sep 2006 08:18:39 GMT
Hi Otis,

I am completing my bachelor thesis at the Free University of Bolzano 
( My project is exactly about what you need: a word 
splitter for German compound words. Raffaella Bernardi who is reading in 
CC is my supervisor.
As some from the lucene mailing list has already suggested, I have used 
the lexicon of German nouns extracted from Morphy 
( As for the 
splitting algorithm, I have used the one Maaten De Rijke and Christof 
Monz have published in /Shallow Morphological Analysis in Monolingual
Information Retrieval for Dutch, German and Italian /(website here 
<>, document here 
I did some testing and minor improvement on it (as I needed to "adjust" 
it for the solution I was working on) and could send you my thesis paper 
(actually still in draft state), which contains statistical data on 

Let me know

Otis Gospodnetic ha scritto:
> Hi,
> How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a
look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer
doesn't treat compounds in any special way at all.
> One way to go about this is to have a word dictionary and a tokenizer that processes
input one character at a time, looking for a word match in the dictionary after each processed
characters.  Then, CompoundWordLikeThis could be broken down into multiple tokens/words and
returned at a set of tokens at the same position.  However, somehow this doesn't strike me
as a very smart and fast approach.
> What are some better approaches?
> If anyone has implemented anything that deals with this problem, I'd love to hear about
> Thanks,
> Otis
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

"As far as the laws of mathematics refer to reality, they are not certain, as far as they
are certain, they do not refer to reality."

(Albert Einstein)

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message