Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: unknown (idunn.apache.osuosl.org: domain gmail.com does not
 designate 85.33.2.18 as permitted sender)
Message-ID: <4514EF30.60907@gmail.com>
Date: Sat, 23 Sep 2006 10:24:16 +0200
From: Pasquale Imbemba <p.imbemba@gmail.com>
User-Agent: Mozilla Thunderbird 1.5.0.7 (Windows/20060909)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Analysis/tokenization of compound words
References: <20060919162155.56039.qmail@web50313.mail.yahoo.com>
 <4514EDDF.50308@gmail.com>
In-Reply-To: <4514EDDF.50308@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Otis,

I forgot to mention that I make use of Lucene for noun retrieval from 
the lexicon.

Pasquale

Pasquale Imbemba ha scritto:
> Hi Otis,
>
> I am completing my bachelor thesis at the Free University of Bolzano 
> (www.unibz.it). My project is exactly about what you need: a word 
> splitter for German compound words. Raffaella Bernardi who is reading 
> in CC is my supervisor.
> As some from the lucene mailing list has already suggested, I have 
> used the lexicon of German nouns extracted from Morphy 
> (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for 
> the splitting algorithm, I have used the one Maaten De Rijke and 
> Christof Monz have published in /Shallow Morphological Analysis in 
> Monolingual
> Information Retrieval for Dutch, German and Italian /(website here 
> <http://www.dcs.qmul.ac.uk/%7Echristof/>, document here 
> <http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). 
> I did some testing and minor improvement on it (as I needed to 
> "adjust" it for the solution I was working on) and could send you my 
> thesis paper (actually still in draft state), which contains 
> statistical data on correctness.
>
> Let me know
> Pasquale
>
> Otis Gospodnetic ha scritto:
>> Hi,
>>
>> How do people typically analyze/tokenize text with compounds (e.g. 
>> German)?  I took a look at GermanAnalyzer hoping to see how one can 
>> deal with that, but it turns out GermanAnalyzer doesn't treat 
>> compounds in any special way at all.
>>
>> One way to go about this is to have a word dictionary and a tokenizer 
>> that processes input one character at a time, looking for a word 
>> match in the dictionary after each processed characters.  Then, 
>> CompoundWordLikeThis could be broken down into multiple tokens/words 
>> and returned at a set of tokens at the same position.  However, 
>> somehow this doesn't strike me as a very smart and fast approach.
>> What are some better approaches?
>> If anyone has implemented anything that deals with this problem, I'd 
>> love to hear about it.
>>
>> Thanks,
>> Otis
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>   
>

-- 
"As far as the laws of mathematics refer to reality, they are not certain, as far as they are certain, they do not refer to reality."

(Albert Einstein)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org