lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Analysis/tokenization of compound words
Date Tue, 19 Sep 2006 16:52:42 GMT

On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote:

> How do people typically analyze/tokenize text with compounds (e.g.  
> German)?  I took a look at GermanAnalyzer hoping to see how one can  
> deal with that, but it turns out GermanAnalyzer doesn't treat  
> compounds in any special way at all.
>
> One way to go about this is to have a word dictionary and a  
> tokenizer that processes input one character at a time, looking for  
> a word match in the dictionary after each processed characters.   
> Then, CompoundWordLikeThis could be broken down into multiple  
> tokens/words and returned at a set of tokens at the same position.   
> However, somehow this doesn't strike me as a very smart and fast  
> approach.

This came up on the KinoSearch list a few weeks ago, and best  
solution I could think of used essentially the same algorithm you  
describe.

During the discussion, we found this:

http://www.glue.umd.edu/~oard/courses/708a/fall01/838/P2/

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message