lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Choice of indexed Character set
Date Fri, 09 Aug 2002 18:06:33 GMT
Manish, in the future, please send questions to lucene-dev, not to me 
directly.  Thanks.

Manish Shukla wrote:
> Just wanted to ask you, what logic did we use to chose
> which characters to index while creating the
> StandardTokenizer.jj file 
> We use follwing to index currently. and tokenize of
> rest.
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\u3040"-"\u318f",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u3d2d",
>        "\u4e00"-"\u9fff",
>        "\uf900"-"\ufaff"
> Looking at the list it seems a little arbitrary in
> some respects. we are indexing  
> Katakana, Hiragana,  Bopomofo,Hangul Compatibility
> Jamo but we are skipping some of the characters in
> latin Supplement and extended latin ranges.
> I am a little confused. I want to index only 8859
> character set. Hence want to find out the logic. Am I
> missing something.

I don't remember where that came from.  I think it may have been copied 
from the Java 1.0 implementation of Character.isLetter().  It could 
probably stand to be updated.  Please feel free to make a proposal.

If you only want 8859, then you're probably best off writing your own 
tokenizer, perhaps modelling it after StandardTokenizer.


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message