lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Analysis/tokenization of compound words
Date Sat, 23 Sep 2006 19:33:39 GMT
Thanks for the pointers, Pasquale!

Otis

----- Original Message ----
From: Pasquale Imbemba <p.imbemba@gmail.com>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 4:24:16 AM
Subject: Re: Analysis/tokenization of compound words

Otis,

I forgot to mention that I make use of Lucene for noun retrieval from 
the lexicon.

Pasquale

Pasquale Imbemba ha scritto:
> Hi Otis,
>
> I am completing my bachelor thesis at the Free University of Bolzano 
> (www.unibz.it). My project is exactly about what you need: a word 
> splitter for German compound words. Raffaella Bernardi who is reading 
> in CC is my supervisor.
> As some from the lucene mailing list has already suggested, I have 
> used the lexicon of German nouns extracted from Morphy 
> (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for 
> the splitting algorithm, I have used the one Maaten De Rijke and 
> Christof Monz have published in /Shallow Morphological Analysis in 
> Monolingual
> Information Retrieval for Dutch, German and Italian /(website here 
> <http://www.dcs.qmul.ac.uk/%7Echristof/>, document here 
> <http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). 
> I did some testing and minor improvement on it (as I needed to 
> "adjust" it for the solution I was working on) and could send you my 
> thesis paper (actually still in draft state), which contains 
> statistical data on correctness.
>
> Let me know
> Pasquale
>
> Otis Gospodnetic ha scritto:
>> Hi,
>>
>> How do people typically analyze/tokenize text with compounds (e.g. 
>> German)?  I took a look at GermanAnalyzer hoping to see how one can 
>> deal with that, but it turns out GermanAnalyzer doesn't treat 
>> compounds in any special way at all.
>>
>> One way to go about this is to have a word dictionary and a tokenizer 
>> that processes input one character at a time, looking for a word 
>> match in the dictionary after each processed characters.  Then, 
>> CompoundWordLikeThis could be broken down into multiple tokens/words 
>> and returned at a set of tokens at the same position.  However, 
>> somehow this doesn't strike me as a very smart and fast approach.
>> What are some better approaches?
>> If anyone has implemented anything that deals with this problem, I'd 
>> love to hear about it.
>>
>> Thanks,
>> Otis
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>   
>

-- 
"As far as the laws of mathematics refer to reality, they are not certain, as far as they
are certain, they do not refer to reality."

(Albert Einstein)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message