lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: N-gram layer
Date Sat, 13 Mar 2004 12:06:25 GMT
karl wettin wrote:
> On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
> Otis Gospodnetic <> wrote:
>>Looking forward to the contribution.
> Sorry for the delay, but I've had quite some workload lately, and then I
> moved between apartments. I'm back and I'm ready to spend some time.
> I gave up detecting the language of a query. It is very possbile indeed
> and I got great results with Weka, but takes too much time: 5-50 seconds
> on my Pentium M. 
> However, I'm still working on the "autoanalytic stemmer", atleast in my
> head. I've started to feed my index with docuemnts tagged with the
> language in a field, and thought it should analyze (still the n-gram
> approach) all  words of a specific language to find stemming rules for
> each and every language. The output can be used per language stemming,
> BUT hopefully I'll be able to use this data to create my generic
> stemmer.
> The language models and inflectional form extraction should be based on
> the index content, but I can't seem to find out how to access the terms
> of a specific set of documents. Of course, I could just query my index
> and start working on the data, building my own trie-pattern, but I'm 
> sure I don't have to.

Please take a look at, and its stemmer package -
it does exactly this, and it's based on a solid research... :-) In my
experience, the stemmers built with this package work exceptionally
well, even for complex inflection-rich languages like the Slavic family.

However, you need to always know the language of the document in advance
- my belief is that it's impossible to build a "universal stemmer good
for any language".

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message