lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ernesto De Santis" <ernesto.desan...@colaborativa.net>
Subject Re: spanish stemmer
Date Mon, 23 Aug 2004 21:59:10 GMT
Hi Chad

> One more question to the group.  From what I have gathered, my choices for
indexing and querying Spanish content are:

> 1.  StandardAnalyzer (I read that this analyzer could be used for
"European" languages)

The StandardAnalyzer not is for European languages, is like a generic
analyzer.

> 2.  SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);  <--custom stop words
from Ernesto class below

> Can I assume that choice 2 would be the better for Spanish content?

Yes, is too better.

For example:
In StandardAnalyzer, caminar, caminantes, camino, etc, are differents words,
only return hit if the match is exactly.
In SpanishAnalyzer, are the same word. This three words are conjugations of
caminar. If in your index, one document have the word "caminante", you can
get the hit with the differents conjugations of this verb.

The operation of stemmers is strip the words according to the rules of the
language (spanish for us).
caminar, caminantes, camino are stored as camin. (Camin not exist in
spanish).

This improvement the quality of hits


>thanks,
> chad.

Bye, Ernesto.


-----Original Message-----
From: Ernesto De Santis [mailto:ernesto.desantis@colaborativa.net]
Sent: Monday, August 23, 2004 3:31 PM
To: Lucene Users List
Subject: Re: spanish stemmer


Because the SnowballAnalyzer, and SpanishStemmer don´t have a default
stopword set.

SnowballAnalyzer constructor:

  /** Builds the named analyzer with no stop words. */
  public SnowballAnalyzer(String name) {
    this.name = name;
  }

Note the comment.

Bye,
Ernesto.

----- Original Message ----- 
From: "Chad Small" <Chad.Small@definityhealth.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, August 23, 2004 4:57 PM
Subject: RE: spanish stemmer


Excellent Ernesto.

Was there a reason you used your own stop word list and not just the default
constructor SnowballAnalyzer("Spanish")?

thanks,
chad.

-----Original Message-----
From: Ernesto De Santis [mailto:ernesto.desantis@colaborativa.net]
Sent: Monday, August 23, 2004 2:03 PM
To: Lucene Users List
Subject: Re: spanish stemmer


Yes, is too easy.

You need do a wrapper for spanish Snowball initilization.

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

above the complete code.

Bye, Ernesto.


--------------------------------------------------
public class SpanishAnalyzer extends Analyzer {

private static SnowballAnalyzer analyzer;


private String SPANISH_STOP_WORDS[] = {

"un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras",
"otro", "algún", "alguno", "alguna",

"algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy",
"esta", "estamos", "estais",

"estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba",
"ante", "antes", "siendo",

"ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis",
"pueden", "fui", "fue", "fuimos",

"fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada",
"fin", "incluso", "primero",

"desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos",
"consiguen", "ir", "voy", "va",

"vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene",
"tenemos", "teneis", "tienen",

"el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos",
"ellas", "nos", "nosotros", "vosotros",

"vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe",
"sabemos", "sabeis", "saben",

"ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas",
"sus", "entonces", "tiempo",

"verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta",
"ciertas", "intentar", "intento",

"intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo",
"arriba", "encima", "usar",

"uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo",
"empleas", "emplean", "ampleamos",

"empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien",
"cual", "cuando", "donde",

"mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar",
"trabajas", "trabaja", "trabajamos",

"trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian",
"podriais", "yo", "aquel", "mi",

"de", "a", "e", "i", "o", "u"};

public SpanishAnalyzer() {

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

}

public SpanishAnalyzer(String stopWords[]) {

analyzer = new SnowballAnalyzer("Spanish", stopWords);

}

public TokenStream tokenStream(String fieldName, Reader reader) {

return analyzer.tokenStream(fieldName, reader);

}

}





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org





---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message