lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Problems Indexing/Parsing Tibetan Text
Date Fri, 30 Mar 2012 16:57:09 GMT
On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <denisbrodeur@gmail.com> wrote:
> Hello, I'm currently working out some problems when searching for Tibetan
> Characters.  More specifically: /u0f10-/u0f19.  We are using the

unicode doesn't consider most of these characters part of a word: most
are punctuation and symbols
(except 0f18 and 0f19 which are combining characters that combine with digits).

for example 0f14 is a text delimiter.

in general standardtokenizer discards punctuation and is geared at
word boundaries, just like
you would have trouble searching on characters like '(', etc in
english. So i think its totally expected.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message