lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
Date Thu, 03 Feb 2011 15:24:28 GMT


Michael McCandless commented on LUCENE-2843:

bq. Thank you. I will use the FixedGap-version myself, but that only works when I'm the one
controlling the index build, right?

Right, but, this is fair?  I mean, it's easy (in Lucene 4.0) to pick the appropriate codec
per field.  So, if people want to use your faceting package, and you explain that it requires
using a certain Codec, that seems OK?

As for the faceting system then the principle really simple: Instead of holding terms (BytesRefs)
in memory, I just hold their ordinals. As the terms themselves only need to be resolved when
the final faceting result is to be returned, seeking for a few hundred or thousand terms by
their ordinal has worked very well so far (no guarantees for old hardware such as spinning
disks though).
OK that makes sense... impressive that seeking up to a few thousand terms is giving you good
perf.  You could also load DocTermsIndex in FieldCache, but of course then all terms data
& ords are RAM resident (and the point of LUCENE-2369 is to have low memory overhead).

> Add variable-gap terms index impl.
> ----------------------------------
>                 Key: LUCENE-2843
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>         Attachments: LUCENE-2843.patch, LUCENE-2843.patch
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message