lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Analysis/tokenization of compound words
Date Tue, 19 Sep 2006 16:21:55 GMT

How do people typically analyze/tokenize text with compounds (e.g. German)?  I took a look
at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer
doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that processes input
one character at a time, looking for a word match in the dictionary after each processed characters.
 Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at
a set of tokens at the same position.  However, somehow this doesn't strike me as a very smart
and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to hear about it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message