lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <>
Subject Unicode normalisation *before* tokenisation?
Date Mon, 17 Jan 2011 00:37:12 GMT
Hi all.

I discovered there is a normalise filter now, using ICU's Normalizer2
(  However, as
this is a filter, various problems can result if used with

One in particular is half-width Katakana.

Supposing you start out with the following (four Java chars):


StandardTokenizer will break this up into four separate tokens of one
char each.  (I can't show an example as the combining handakuten
doesn't actually render by itself on my system.)

Passing this through the normalising filter converts it back to normal
full-width Katakana, but unfortunately still generates four separate

Thinking that I could avoid this problem if I filtered the content
*before* the tokeniser got to it, I wrote a NormalisingReader and
passed the text through this before the tokeniser could get its hands
on it.  I figured this would also be faster, as normalisation could be
done in chunks of a few kilobytes instead of smaller chunks.
Unfortunately, it doesn't give usable results, because the text
offsets the tokeniser reports are relative to the normalising reader,
not the original text.  This quickly caused issues when trying to
highlight the hits.

There are alternative workarounds for the specific issue of half-width
Katakana, of course:
    1. Write a filter which joins tokens containing a dakuten or
handakuten with the previous token.
    2. Modify the tokeniser itself to make it output such pairs as
single tokens.

I am currently reluctant to either of these, as there are other issues
with not normalising up-front.  For instance, if "½" appeared, it
would be indexed as one token "1/2", and someone searching for it by
typing "1/2" would not find it, as "1/2" would be analysed as two
tokens ("1","2".)  Writing a general filter to join together tokens
which normalise with each other (and split apart those which decompose
into a sequence which won't recompose into anything) seems to be a
significantly difficult task.

So I guess I have two questions:
    1. Is there some way to do filtering to the text before
tokenisation without upsetting the offsets reported by the tokeniser?
    2. Is there some more general solution to this problem, such as an
existing tokeniser similar to StandardTokeniser but with better
Unicode awareness?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message