lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno Roustant (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
Date Tue, 14 May 2019 13:44:00 GMT


Bruno Roustant commented on LUCENE-8753:

Beyond the performance aspects, we developed UniformSplit to be extensible. To give an idea
of how it can be extended, I have added a new PR#676: SharedTerms UniformSplit.

The use-case is when there are many fields. We want to take advantage of the FST property
to share the terms between all the fields, by replacing one FST per field by a single FST
containing the shared terms. In this case each term is stored only once in the block file,
and its block line contains the TermState for each different field for which the term occurs.

term A -> field1 TermState, field2 TermState, field3 TermState

term B -> field3 TermState, field5 TermState

The FST is compact and this posting format also unlocks the possibility to cache when the
same term is searched in many fields (but this is not part of this PR).

My goal here is to showcase the extensibility of this posting format. This extension is in
a separate sub-package sharedterms and is quite concise. (the only tricky part is the custom
merge to merge efficiently two segments by accessing directly the sharedterms posting format)

> New PostingFormat - UniformSplit
> --------------------------------
>                 Key: LUCENE-8753
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.0
>            Reporter: Bruno Roustant
>            Assignee: David Smiley
>            Priority: Major
>         Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>          Time Spent: 20m
>  Remaining Estimate: 0h
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to access the
block, but not as a prefix trie, rather with a seek-floor pattern. For the selection of the
blocks, there is a target average block size (number of terms), with an allowed delta variation
(10%) to compare the terms and select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more compact and speed
up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, comparing UniformSplit
with BlockTree. Find it in the first comment and also attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time is reduced
by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
> This initial version passes all Lucene tests. Use “ant test -Dtests.codec=UniformSplitTesting”
to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we have already
exercised this PostingsFormat extensibility to create a different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message