lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Simplifying Lucene 4 storage formats
Date Tue, 26 Mar 2013 20:04:39 GMT
I think you can get the most bang for your buck using the high-level
controls to disable the parts of the index you don't need ... all
codecs respect these.

EG, index your fields with omitNorms=true, so no boost & length
normalization is stored in the index / loaded at search time.

Index with IndexOption.DOCS_ONLY, so no positions nor freq information
are stored in the postings lists, which means you cannot run
positional queries and scoring will not reflect how many times a term
occurs in each doc.

Set CONSTANT_SCORE_AUTO_REWRITE rewrite method for the
MultiTermQueries (this is the default for most of them, except
FuzzyQuery): this will avoid all scoring at search time.

Don't turn on stored fields, term vectors.

Then, if these steps are insufficient, consider making a custom codec
that specializes how things are encoded.

Mike McCandless

On Tue, Mar 26, 2013 at 3:00 PM, Vitaly Funstein <> wrote:
> This is probably a pretty general inquiry, but I'm just exploring this as
> an option at the moment.
> It seems that Lucene 4 adds some freedom to define how data is actually
> written to underlying storage by exposing the codec API. However, I find
> the learning curve for understanding what bits to change quite steep, i.e.
> one really needs to get into the guts of storage formats and how data in
> these formats is actually consumed by search queries.
> Is there some type of tutorial, possibly with code samples, that would
> guide me through what needs to be done for specific use cases? Basically,
> what I am looking for is the ability to "turn off" certain features of the
> engine, creating a "lite" version of Lucene's codec that would both cut
> down on the amount of data to persist while indexing, and on query
> execution time. To be a bit more specific, the queries in my case do not go
> beyond NumericRangeQuery, WildCardQuery and TermQuery types, so things like
> similarities, boosts and scoring are not used. So obviously I want to
> preserve the existing functionality while removing support for features I'm
> not using (yet).
> Thanks.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message