lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
Date Thu, 03 Feb 2011 15:24:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990123#comment-12990123
] 

Michael McCandless commented on LUCENE-2843:
--------------------------------------------

bq. Thank you. I will use the FixedGap-version myself, but that only works when I'm the one
controlling the index build, right?

Right, but, this is fair?  I mean, it's easy (in Lucene 4.0) to pick the appropriate codec
per field.  So, if people want to use your faceting package, and you explain that it requires
using a certain Codec, that seems OK?

{quote}
As for the faceting system then the principle really simple: Instead of holding terms (BytesRefs)
in memory, I just hold their ordinals. As the terms themselves only need to be resolved when
the final faceting result is to be returned, seeking for a few hundred or thousand terms by
their ordinal has worked very well so far (no guarantees for old hardware such as spinning
disks though).
{quote}
OK that makes sense... impressive that seeking up to a few thousand terms is giving you good
perf.  You could also load DocTermsIndex in FieldCache, but of course then all terms data
& ords are RAM resident (and the point of LUCENE-2369 is to have low memory overhead).

> Add variable-gap terms index impl.
> ----------------------------------
>
>                 Key: LUCENE-2843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message