lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chad Small" <Chad.Sm...@definityhealth.com>
Subject RE: spanish stemmer
Date Mon, 23 Aug 2004 21:12:16 GMT
One more question to the group.  From what I have gathered, my choices for indexing and querying
Spanish content are:

1.  StandardAnalyzer (I read that this analyzer could be used for "European" languages)

2.  SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);  <--custom stop words from Ernesto
class below

Can I assume that choice 2 would be the better for Spanish content?

thanks,
chad.



-----Original Message-----
From: Ernesto De Santis [mailto:ernesto.desantis@colaborativa.net]
Sent: Monday, August 23, 2004 3:31 PM
To: Lucene Users List
Subject: Re: spanish stemmer 


Because the SnowballAnalyzer, and SpanishStemmer don´t have a default
stopword set.

SnowballAnalyzer constructor:

  /** Builds the named analyzer with no stop words. */
  public SnowballAnalyzer(String name) {
    this.name = name;
  }

Note the comment.

Bye,
Ernesto.

----- Original Message ----- 
From: "Chad Small" <Chad.Small@definityhealth.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, August 23, 2004 4:57 PM
Subject: RE: spanish stemmer


Excellent Ernesto.

Was there a reason you used your own stop word list and not just the default
constructor SnowballAnalyzer("Spanish")?

thanks,
chad.

-----Original Message-----
From: Ernesto De Santis [mailto:ernesto.desantis@colaborativa.net]
Sent: Monday, August 23, 2004 2:03 PM
To: Lucene Users List
Subject: Re: spanish stemmer


Yes, is too easy.

You need do a wrapper for spanish Snowball initilization.

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

above the complete code.

Bye, Ernesto.


--------------------------------------------------
public class SpanishAnalyzer extends Analyzer {

private static SnowballAnalyzer analyzer;


private String SPANISH_STOP_WORDS[] = {

"un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras",
"otro", "algún", "alguno", "alguna",

"algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy",
"esta", "estamos", "estais",

"estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba",
"ante", "antes", "siendo",

"ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis",
"pueden", "fui", "fue", "fuimos",

"fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada",
"fin", "incluso", "primero",

"desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos",
"consiguen", "ir", "voy", "va",

"vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene",
"tenemos", "teneis", "tienen",

"el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos",
"ellas", "nos", "nosotros", "vosotros",

"vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe",
"sabemos", "sabeis", "saben",

"ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas",
"sus", "entonces", "tiempo",

"verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta",
"ciertas", "intentar", "intento",

"intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo",
"arriba", "encima", "usar",

"uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo",
"empleas", "emplean", "ampleamos",

"empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien",
"cual", "cuando", "donde",

"mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar",
"trabajas", "trabaja", "trabajamos",

"trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian",
"podriais", "yo", "aquel", "mi",

"de", "a", "e", "i", "o", "u"};

public SpanishAnalyzer() {

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

}

public SpanishAnalyzer(String stopWords[]) {

analyzer = new SnowballAnalyzer("Spanish", stopWords);

}

public TokenStream tokenStream(String fieldName, Reader reader) {

return analyzer.tokenStream(fieldName, reader);

}

}





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message