lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Analysis/tokenization of compound words
Date Sat, 23 Sep 2006 19:33:39 GMT
Thanks for the pointers, Pasquale!


----- Original Message ----
From: Pasquale Imbemba <>
Sent: Saturday, September 23, 2006 4:24:16 AM
Subject: Re: Analysis/tokenization of compound words


I forgot to mention that I make use of Lucene for noun retrieval from 
the lexicon.


Pasquale Imbemba ha scritto:
> Hi Otis,
> I am completing my bachelor thesis at the Free University of Bolzano 
> ( My project is exactly about what you need: a word 
> splitter for German compound words. Raffaella Bernardi who is reading 
> in CC is my supervisor.
> As some from the lucene mailing list has already suggested, I have 
> used the lexicon of German nouns extracted from Morphy 
> ( As for 
> the splitting algorithm, I have used the one Maaten De Rijke and 
> Christof Monz have published in /Shallow Morphological Analysis in 
> Monolingual
> Information Retrieval for Dutch, German and Italian /(website here 
> <>, document here 
> <>). 
> I did some testing and minor improvement on it (as I needed to 
> "adjust" it for the solution I was working on) and could send you my 
> thesis paper (actually still in draft state), which contains 
> statistical data on correctness.
> Let me know
> Pasquale
> Otis Gospodnetic ha scritto:
>> Hi,
>> How do people typically analyze/tokenize text with compounds (e.g. 
>> German)?  I took a look at GermanAnalyzer hoping to see how one can 
>> deal with that, but it turns out GermanAnalyzer doesn't treat 
>> compounds in any special way at all.
>> One way to go about this is to have a word dictionary and a tokenizer 
>> that processes input one character at a time, looking for a word 
>> match in the dictionary after each processed characters.  Then, 
>> CompoundWordLikeThis could be broken down into multiple tokens/words 
>> and returned at a set of tokens at the same position.  However, 
>> somehow this doesn't strike me as a very smart and fast approach.
>> What are some better approaches?
>> If anyone has implemented anything that deals with this problem, I'd 
>> love to hear about it.
>> Thanks,
>> Otis
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

"As far as the laws of mathematics refer to reality, they are not certain, as far as they
are certain, they do not refer to reality."

(Albert Einstein)

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message