lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3206) FST package API refactoring
Date Sat, 18 Jun 2011 09:57:48 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051491#comment-13051491
] 

Michael McCandless commented on LUCENE-3206:
--------------------------------------------

{quote}
bq. this could be a non-negligible increase in FST size for the non-ascii case I think?

I don't know. If the non-ASCII is encoded as UTF8 for the BytesRef, then storing full unicode
points on transitions shouldn't really account for much more (in fact it may create fewer
states/ transitions because multibyte UTF8 sequences will require multiple transitions)? This
we would need to check, of course. And I assume input sequences ARE text, which in general
may not be the case... I think I'll leave BYTE1/BYTE4 an option for now and see if I can improve
on it once I have a working test suite.
{quote}

Ahh, yes I agree it'd be a more interesting comparison if you use
UTF32 instead of UTF8.

The case I was worried about is if you must use UTF8 (ie because
TermsEnum speaks only BytesRef), then writing those bytes as a vInt
instead of a fixed byte is a penalty to non-ascii.

{quote}
bq. I think SimpleText codec is a good example? Also VariableGapTermsIndexReader, and MemoryCodec?
Each of these use the BytesRefFSTEnum, I believe.

I wasn't clear -- I can find the places where they're used, but I wanted to clarify the nature
of stored keys and values (are they UTF8 text, utf16, unicode, random bytes)? I can go through
the code, but you're probably a faster source of information on this one. Robert, if you're
reading this -- anything you envision could be stored as transition labels?
{quote}

Ahh... I think all uses have BytesRef (UTF8 encoded term) as the key,
and various things as the values.

I don't think we've used FST during analysis yet but we should try;
then I suspect we'd use UTF16 labels?


> FST package API refactoring
> ---------------------------
>
>                 Key: LUCENE-3206
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3206
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>    Affects Versions: 3.2
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 3.3, 4.0
>
>         Attachments: LUCENE-3206.patch
>
>
> The current API is still marked @experimental, so I think there's still time to fiddle
with it. I've been using the current API for some time and I do have some ideas for improvement.
This is a placeholder for these -- I'll post a patch once I have a working proof of concept.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message