lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: special characters "ø" indexing/searching
Date Mon, 22 Nov 2010 20:20:17 GMT
On Mon, Nov 22, 2010 at 2:58 PM, Chris Hostetter
<hossman_lucene@fucit.org> wrote:
> Can you elaborate on that, because it's definitely something that i'm
> getting more and more confused by, so i'm sure other people are confused
> as well.
>
> what is an example of a situation where you "must" change stuff before the
> tokenizer?  the HTML Stripper is the one example i understand, but the
> purpose of hte mapping char filter no longer make sense to me in light of
> this thread.

I think any situation where you need to out-smart your tokenizer. For
example, maybe you have a tokenizer such as StandardTokenizer and you
are pretty happy with how it works overall, but you want to customize
how some specific characters behave.

In such a situation you could modify the rules and re-build your own
tokenizer with javacc, but perhaps its easier to simply map some of
the characters before tokenization with a CharFilter.

One example: for the Persian language you might want your tokenizer to
split on the zero-width-non-joiner
(http://en.wikipedia.org/wiki/Zero-width_non-joiner) character. But in
unicode, this is a technically a "format" character... in most
languages its only a hint to the rendering engine to inhibit the
character from forming a ligature with succeeding characters, and you
*usually* really shouldn't break around it with the tokenizer
[instead, typically keep it as part of the token and normalize it away
in a tokenfilter since its a "default ignorable"].

But for Persian you typically want to break around it, in case someone
uses this character instead of a regular space for affixes/compounds
(see here for more information:
http://128.187.33.4/persian/persianword/zwnj.htm), usually then also
adding many of these affixes (such as plural -ha) to your stopword
list.

So in this case you could override the tokenizer for this language by
just normalizing this to a regular space with MappingCharFilter, then
your tokenizer will split around it.

Mime
View raw message