lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: What is the proper use of stop words in Lucene?
Date Mon, 28 Apr 2014 20:36:08 GMT
Hi,

> > What you intend to do is not a "stopword" use case. You want to "ignore"
> some words - Lucene has no support for this, because in native language
> processing this makes no sense.
> 
> Thank you for the information. I was unaware that ignoring some words
> "makes no sense". I thought I gave a reasonable example of exactly this
> situation in the native processing of Tibetan. Perhaps I am still not
> understanding.

Elisions are a bit different than stopwords (although I don't know about them in Tibet language).
The Tokenizer should *not* split Elisions from the terms (initially the term is the full word
including the elision). In most languages those are separated by (for example) an apostrophe
(e.g. French: le + arbre → l’arbre). The Tokenizer would keep those parts together (l’arbre).
A later TokenFilter would then edit the token and remove the elision (if needed): arbre. This
is how the French Analyzer in Lucene works.

Lucene currently does not have Tibetanian Analyzer, so you have to make your own one (I think
this is what you tried to do). You should carefully choose the Tokenizer and add something
like an TibetanElisionFilter that removes the not wanted parts from the tokens.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message