lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Analysis/tokenization of compound words
Date Sat, 23 Sep 2006 15:14:49 GMT

On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote:

> Writing a decomposer is difficult as you need both a large dictionary
> *without* compounds and a set of rules to avoid splitting at too many
> positions.

Conceptually, how different is the problem of decompounding German  
from tokenizing languages such as Thai and Japanese, where "words"  
are not separated by spaces and may consist of multiple characters?

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message