lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Han Jiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
Date Tue, 13 Aug 2013 11:34:49 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738105#comment-13738105
] 

Han Jiang commented on LUCENE-3069:
-----------------------------------

Hi, currently, we have problem when migrating the codes to trunk:

The API refactoring on PostingsReader/WriterBase now splits term metadata into two parts:
monotonic long[] and generical byte[], the former is known by term dictionary for better
d-gap encoding. 

So we need a 'longsSize' in field summary, to tell reader the fixed length of this monotonic
long[]. However, this API change actually breaks backward compability: the old 4.x indices
didn't 
support this, and for some codec like Lucene40, since their writer part are already deprecated,

their tests won't pass.

It seems like we can put all the metadata in generic byte[] and let PBF do its own buffering

(like we do in old API: nextTerm() ), however we'll have to add logics for this, in every
PBF then.

So... can we solve this problem more elegantly?
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch,
LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch,
LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a delta codec
file for scanning to terms. Some environments have enough memory available to keep the entire
FST based term dict in memory. We should add a TermDictionary implementation that encodes
all needed information for each term into the FST (custom fst.Output) and builds a FST from
the entire term not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message