lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: add(CharSequence) in automaton builder
Date Fri, 01 Apr 2011 12:21:05 GMT
On Fri, Apr 1, 2011 at 7:58 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote:
> Mike, can you remember what ordering is required for
> add(CharSequence)? I see it requires INPUT_TYPE.BYTE4
>
> assert fst.getInputType() == FST.INPUT_TYPE.BYTE4;
>
> but this would imply the order of full unicode codepoints on the
> input? Is this what String comparators do by default (I doubt, but
> wanted to check if you know first).
>

(sorry not mike, but) you are right, String.compareTo() compares in
utf-16 order by default. this is not consistent with the order the FST
builder expects (utf8/utf32 order)

So if you are going to order the terms before passing them to Builder,
you should either use a utf-16-in-utf-8-order comparator* (or simply
use codePointAt and friends and compare those ints, probably
slower...)

different ways of impl'ing the comparator below:
* http://icu-project.org/docs/papers/utf16_code_point_order.html
* http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message