lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Re: Analysis/tokenization of compound words
Date Wed, 20 Sep 2006 08:56:51 GMT
On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote:
> How do people typically analyze/tokenize text with compounds (e.g.
> German)?  I took a look at GermanAnalyzer hoping to see how one can
> deal with that, but it turns out GermanAnalyzer doesn't treat
> compounds in any special way at all.

I've been looking close at this, but for Swedish. The major problem in
that case is how a composite word generally have totaly diffrent
sematincs compared to the parts. Here is a classic school example:

"En brun hårig sjuk sköterska" 
A brown hairy sick care taker

"En brunhårig sjuksköterska"
A brunette nurse

Thus it is not very helpful to index the composite parts by them self.
It is really a problem to be handled by a spell checker. So I wrote and
posted the Jira issue 626, an adaptive session analysing spell checker.
It makes recommendations based on how previous users changed their
queries. davinci -> da vinci. heroes iii -> heroes 3. And so on.

This strategy does however require quite some user traffic.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message