lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3206) FST package API refactoring
Date Sun, 19 Jun 2011 13:51:47 GMT


Michael McCandless commented on LUCENE-3206:

OK, these results make sense!  UTF32 (vInt labels) is more compact than UTF8, if you disable
array'd arcs.  These wiki terms are from the en export right?  So the differences are due
to the smallish number of random terms that are not English... it should be more extreme if
we used non-English content.

I wonder how lookup time would compare... I think UTF32 should be faster?

And yes for truly binary terms (eg collated fields, and maybe eventually numeric fields but
not yet because they still avoid the 8th bit I think) I think we want to keep BYTE1.

We need some good use cases of FSTs during analysis... there we are free to make the alphabet
non-byte (vs the index, where terms are a BytesRef).

> FST package API refactoring
> ---------------------------
>                 Key: LUCENE-3206
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 3.2
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 3.3, 4.0
>         Attachments: LUCENE-3206.patch
> The current API is still marked @experimental, so I think there's still time to fiddle
with it. I've been using the current API for some time and I do have some ideas for improvement.
This is a placeholder for these -- I'll post a patch once I have a working proof of concept.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message