lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pasquale Imbemba <p.imbe...@gmail.com>
Subject Re: Analysis/tokenization of compound words
Date Sat, 23 Sep 2006 08:18:39 GMT
Hi Otis,

I am completing my bachelor thesis at the Free University of Bolzano 
(www.unibz.it). My project is exactly about what you need: a word 
splitter for German compound words. Raffaella Bernardi who is reading in 
CC is my supervisor.
As some from the lucene mailing list has already suggested, I have used 
the lexicon of German nouns extracted from Morphy 
(http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for the 
splitting algorithm, I have used the one Maaten De Rijke and Christof 
Monz have published in /Shallow Morphological Analysis in Monolingual
Information Retrieval for Dutch, German and Italian /(website here 
<http://www.dcs.qmul.ac.uk/%7Echristof/>, document here 
<http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). 
I did some testing and minor improvement on it (as I needed to "adjust" 
it for the solution I was working on) and could send you my thesis paper 
(actually still in draft state), which contains statistical data on 
correctness.

Let me know
Pasquale

Otis Gospodnetic ha scritto:
> Hi,
>
> How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a
look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer
doesn't treat compounds in any special way at all.
>
> One way to go about this is to have a word dictionary and a tokenizer that processes
input one character at a time, looking for a word match in the dictionary after each processed
characters.  Then, CompoundWordLikeThis could be broken down into multiple tokens/words and
returned at a set of tokens at the same position.  However, somehow this doesn't strike me
as a very smart and fast approach.
> What are some better approaches?
> If anyone has implemented anything that deals with this problem, I'd love to hear about
it.
>
> Thanks,
> Otis
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

-- 
"As far as the laws of mathematics refer to reality, they are not certain, as far as they
are certain, they do not refer to reality."

(Albert Einstein)


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message