lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boris Okner" <b.ok...@rogers.com>
Subject Re: RussianAnalyzer
Date Mon, 26 Aug 2002 03:02:11 GMT

----- Original Message -----
From: "Mehran Mehr" <mehran@sharif.edu>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Sunday, August 25, 2002 3:04 AM
Subject: Re: RussianAnalyzer


Hello Mehran,

> Dear Boris,
>
> I've followed up your efforts in Snowball Project, Russian Stemmer was one
> of the last stemmers added to Snowball, Bravo.

No, it's not me who added Russian stemmer to Snowball. I just translated the
algorithm to Java for my own needs, and once it had started to work for me,
I
decided to contribute it to Lucene.

> I just said "I've wrote a universal analyser using ICU.".
> It's better to say "I've wrote a universal tokenizer using ICU.".
>

Exactly, it's a tokenizer for tokenizing Unicode words. You could do the
same with StandardTokenizer(although it should be built with UNICODE_INPUT =
true)
It doesn't do stemming and it doesn't do
stop-words filtering.
Having these 2 features IMHO is what makes up a real language-specific
analyzer, as opposed to tokenizer.


> And about Unicode: proprietary encodings is beginning to decline.
Totally disagree. There will always be non-Unicode encodings, because they
are simply more effective, and not every project needs i18n capabilities.
MySQL, for example, is yet to support Unicode, and those who pay their
hosting provider for disk memory usage, won't use Unicode if they have a
choice - why to double a bill?

> As open source developers we should accelerate this process to avoid
> vendor lock-in, so we can embed this universal tokenizer in Lucene code
> base, and it is always possible to write tokenizers for other encodings.
As I said, "universal tokenizer" doesn't add much value to Lucene in terms
of handling language-specific problems.

> > > We can add an Snowball API to Lucene.
> > Not sure, what that means? Every stemming algorithm in Snowball is
described
> > in terms of Snowball language, but there is no universal stemming API
for
> > all languages.
>
> There is no need to add such an API, there is already one available.
> Filtering mechanism effectively make it possible to use other text
> processing mechanisms.

Again, you've lost me completely here. What is "available stemming API"?

Best regards,
Boris Okner



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message