lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: add(CharSequence) in automaton builder
Date Fri, 01 Apr 2011 12:21:05 GMT
On Fri, Apr 1, 2011 at 7:58 AM, Dawid Weiss <> wrote:
> Mike, can you remember what ordering is required for
> add(CharSequence)? I see it requires INPUT_TYPE.BYTE4
> assert fst.getInputType() == FST.INPUT_TYPE.BYTE4;
> but this would imply the order of full unicode codepoints on the
> input? Is this what String comparators do by default (I doubt, but
> wanted to check if you know first).

(sorry not mike, but) you are right, String.compareTo() compares in
utf-16 order by default. this is not consistent with the order the FST
builder expects (utf8/utf32 order)

So if you are going to order the terms before passing them to Builder,
you should either use a utf-16-in-utf-8-order comparator* (or simply
use codePointAt and friends and compare those ints, probably

different ways of impl'ing the comparator below:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message