lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mehran Mehr <meh...@sharif.edu>
Subject Re: RussianAnalyzer
Date Sun, 25 Aug 2002 07:04:11 GMT
Dear Boris,

I've followed up your efforts in Snowball Project, Russian Stemmer was one 
of the last stemmers added to Snowball, Bravo.

On Sat, 24 Aug 2002, Boris Okner wrote:

> While ICU is a great project,
> 1) AFAIK there are no such things as stop-words filtering and stemming. Of
> course, one might be able to write language-specific transliteration rules
> covering these features for ICU(I have no idea how hard it is), but why
> Lucene should be relying on ICU(or ICU contributors)?

I'm not an ICU developer, I contribute this 30 lines of code written by
myself to Lucene :)

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.Token;
import java.io.Reader;
import java.io.IOException;
import com.ibm.icu.lang.UCharacter;

public class SimpleTokenizer extends Tokenizer {
  public SimpleTokenizer(Reader r) {
    input = r;
  }
  public Token next() throws IOException {
    StringBuffer s = new StringBuffer();
    int i = input.read();
    while ((i >= 0) && !UCharacter.isLetterOrDigit(i)) {
      i = input.read();
    }
    while ((i >= 0) && UCharacter.isLetterOrDigit(i)) {
      s.append(Integer.toString(i,16));
      s.append("-");
      i = input.read();
    }
    if (s.length() > 0) {
      return new Token(s.toString(),0,s.length());
    } else {
      return null;
    }
  }
}

> Lucene users want working analyzers now, so why make them wait for ICU
> before it could be practically usable?

I agree with you we should not keep the users of Lucene waiting.
In order to reach this goal we should broaden Lucene users alternatives.
By restricting them to stuff generated by Lucene community they'll gain 
lesser choices.

> 2) ICU supports Unicode only, but in reality, vast amount of Cyrillic-based
> software still uses (and I dare to say, will use), non-Unicode encodings.
> While it's not a big problem to convert to Unicode for indexing/search,
> converting back and forth introduces significant inefficiency.

Am I said  "We should immigrate to ICU."?
I just said "I've wrote a universal analyser using ICU.".
It's better to say "I've wrote a universal tokenizer using ICU.". 

And about Unicode: proprietary encodings is beginning to decline.
As open source developers we should accelerate this process to avoid 
vendor lock-in, so we can embed this universal tokenizer in Lucene code 
base, and it is always possible to write tokenizers for other encodings.

> > We can add an Snowball API to Lucene.
> Not sure, what that means? Every stemming algorithm in Snowball is described
> in terms of Snowball language, but there is no universal stemming API for
> all languages.

There is no need to add such an API, there is already one available.
Filtering mechanism effectively make it possible to use other text 
processing mechanisms.
 
> Boris Okner
> 

Regards
Mehran Mehr

> ----- Original Message -----
> From: "Mehran Mehr" <mehran@sharif.edu>
> To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> Sent: Saturday, August 24, 2002 2:20 PM
> Subject: Re: RussianAnalyzer
> 
> 
> > -1
> >
> > I think, there is no need to add analyzers of all languages in the world
> > to Lucene Project, We can add an Snowball API to Lucene.
> >
> > I've wrote a universal (about 30 lines of code) analyser using IBM's ICU.
> > I suggest removing German and English analyzers from Lucene :) and replace
> > them with this universal analyser.
> >
> > On Wed, 21 Aug 2002, Doug Cutting wrote:
> >
> > > This looks great to me.
> > >
> > > Does anyone object to adding this to Lucene as the package
> > > org.apache.lucene.analysis.ru?
> > >
> > > Doug
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> 



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message