lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: N-gram layer
Date Sat, 13 Mar 2004 12:06:25 GMT
karl wettin wrote:
> On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
> Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
> 
> 
>>Looking forward to the contribution.
> 
> 
> Sorry for the delay, but I've had quite some workload lately, and then I
> moved between apartments. I'm back and I'm ready to spend some time.
> 
> I gave up detecting the language of a query. It is very possbile indeed
> and I got great results with Weka, but takes too much time: 5-50 seconds
> on my Pentium M. 
> 
> However, I'm still working on the "autoanalytic stemmer", atleast in my
> head. I've started to feed my index with docuemnts tagged with the
> language in a field, and thought it should analyze (still the n-gram
> approach) all  words of a specific language to find stemming rules for
> each and every language. The output can be used per language stemming,
> BUT hopefully I'll be able to use this data to create my generic
> stemmer.
> 
> The language models and inflectional form extraction should be based on
> the index content, but I can't seem to find out how to access the terms
> of a specific set of documents. Of course, I could just query my index
> and start working on the data, building my own trie-pattern, but I'm 
> sure I don't have to.

Please take a look at http://www.egothor.org, and its stemmer package -
it does exactly this, and it's based on a solid research... :-) In my
experience, the stemmers built with this package work exceptionally
well, even for complex inflection-rich languages like the Slavic family.

However, you need to always know the language of the document in advance
- my belief is that it's impossible to build a "universal stemmer good
for any language".

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message