lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Brodeur <>
Subject Problems Indexing/Parsing Tibetan Text
Date Fri, 30 Mar 2012 16:46:48 GMT
Hello, I'm currently working out some problems when searching for Tibetan
Characters.  More specifically: /u0f10-/u0f19.  We are using the
StandardAnalyzer (3.4) and I've narrowed the problem down to
StandardTokenizerImpl throwing away these characters i.e. in
getNextToken(), falls through  case1: /* Not numeric, word, ideographic,
hiragana, or SE Asian -- ignore it */. So, the question is: is this the
expected behaviour and if it is what would be the best way to go about
supporting code points that are not recognized by the StandardAnalyzer in a
general way?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message