lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Analysis/tokenization of compound words
Date Sat, 23 Sep 2006 19:25:39 GMT
Yes, I think it's the same thing - word segmentation - http://www.google.com/search?q=word+segmentation

You may get the same ad(word) as I did - Basistech folks from Cambridge, MA have various interesting
products, some stuff that deals with CJK (not sure if they actually do word segmentation or
just n-gram the input).  Guess who their biggest customer is?  Hint: starts with the letter
G.

Otis


----- Original Message ----
From: Marvin Humphrey <marvin@rectangular.com>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 11:14:49 AM
Subject: Re: Analysis/tokenization of compound words


On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote:

> Writing a decomposer is difficult as you need both a large dictionary
> *without* compounds and a set of rules to avoid splitting at too many
> positions.

Conceptually, how different is the problem of decompounding German  
from tokenizing languages such as Thai and Japanese, where "words"  
are not separated by spaces and may consist of multiple characters?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message