lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Analysis/tokenization of compound words
Date Tue, 19 Sep 2006 16:52:42 GMT

On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote:

> How do people typically analyze/tokenize text with compounds (e.g.  
> German)?  I took a look at GermanAnalyzer hoping to see how one can  
> deal with that, but it turns out GermanAnalyzer doesn't treat  
> compounds in any special way at all.
> One way to go about this is to have a word dictionary and a  
> tokenizer that processes input one character at a time, looking for  
> a word match in the dictionary after each processed characters.   
> Then, CompoundWordLikeThis could be broken down into multiple  
> tokens/words and returned at a set of tokens at the same position.   
> However, somehow this doesn't strike me as a very smart and fast  
> approach.

This came up on the KinoSearch list a few weeks ago, and best  
solution I could think of used essentially the same algorithm you  

During the discussion, we found this:

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message