lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen>
Subject Where do I go with ordinals?
Date Sat, 20 Nov 2010 11:49:24 GMT
A while ago I started on LUCENE-2369 with the goal making fast and memory-efficient locale-aware
sorting in Lucene. During this I’ve wandered quite a bit out of course while exploring the
possibilities in ordinals. I’ve discovered that two building blocks enables sorting, (hierarchical)
faceting, index lookup and range searches in a way that supports custom ordering and has the
trade offs “slow initialization, fast execution, low memory overhead”. I would like to
share my thoughts on this and I would like to hear if and how I should move forward with this.

At the core, this is about using ordinals instead of the terms themselves. Index-wide ordinals.
For Lucene trunk, SegmentReaders supports ordinal-to-term resolving directly, albeit not super
optimized. Other IndexReaders does not support this out of the box and there’s the whole
problem of how to handle duplicate terms. As part of LUCENE-2369 I implemented

Building block 1: A map from custom ordered positions to term ordinals, without duplicates.
Example: Let’s look at the terms for a given field split over 3 segments: Segment 1: A,
B, C Segment 2: B, D Segment 3: A, E The index-wide ordered list of the terms would be A,
B, C, D, E. The map from ordered positions to ordinals would be (segment 1, ordinal 0), (s
1, o 2), (s 1, o 3), (s 2, o 1), (s 3, o 2).

This index-level map takes up #logical_positions*log2(#terms) bits (PackedInts to the rescue),
rounded up. For the example above, this is 5*log2(7) bits. Scaling up to 10 million terms,
out of which 5 million is unique, we have 5M*log2(10M) bits = 5M*24 bits = 14 MByte.

If custom order is used, we would like to cache the calculated segment oriented maps for faster
re-opening of the index. As there are no duplicates inside of a segment, the sum of the segment
maps will be a bit more than #terms*log2(#terms) bits. For 10M terms, this is 29 MByte extra,
bringing the total memory requirement up to 43 MByte.

Building the map requires the sorting of the ordinals at segment level (this can be skipped
if the index order is to be used) plus a merge-sort of the segment oriented ordinals to get
the index oriented ordered positions. This map could be used for custom order range-queries
by doing two binary searches and iterating between the points. It could also be used for index
lookup. Of course, one might just make a field with a custom order at index time (long live
flex), so maybe this way of doing it has little value. However, the building block is a stepping
stone for

Building block 2: A map from document IDs to ordered positions for a given field. Building
this map is in principle a matter of stepping through the terms from building block 1 and
requesting all the document IDs for the terms from all segments. Some memory optimization
can be done if there is only 1 term/document, but the general case is handled by keeping an
array of entry points for all document IDs into an array of ordered term positions for the
document IDs.

Memory requirements for this map is #documents*log2(#documents) + #term_references*log2(#logical
terms) bits. For 20M documents with 40M references to 5M unique terms (aka ordered positions)
this is 20M*log2(20M) + 40M*log2(5M) bits = 20M*25 + 40M*23 bits = 228 MByte. On top of that
comes the memory requirements for building block 1.

With building block 1 + 2 we can implement the rest of the functionality:

Sorting is trivial: The order of two documents is determined by comparing the first integer
from building block 2 for each document. We could also specialize building block 2 to exactly
1 term/document to save some memory and gain a little more speed.

Faceting is simple: A counting array (the obvious choice is to use int[] or long[]) of length
“#ordered_positions from building block 1” is created. The document IDs for all the hits
from a search is used to get the ordered positions from building block 2 and the corresponding
entries in the counting array is incremented. Extracting the tags by count is done by iterating
the counting array with a priority queue, finding the top-X tags ordinals, then extracting
the terms themselves with the map from building block 1. Extracting by order is done by iterating
and collecting the indexes of all entries where the count is > x (typically 1). As far
as I understand, this is more or less the approach used by Solr’s field cache faceting.

Hierarchical faceting is more complex. Let’s leave it for another post and just say that
augmenting building block 1 with about 1 byte/logical term gives us what we want. The code
as well as unit tests is in LUCENE-2369 for the curious.

There is working code that we use in-house with a test-setup. It needs cleanup and polishing,
but does not require changes to existing Lucene code. It could be part of Lucene core (for
sorting and range search at least), it could be a contrib, it might belong in Solr (SOLR-64
or maybe SOLR-792), Bobo-browse or somewhere else. My personal preference is to make a Lucene
contrib so that the maximum amount of projects can benefit, but I would very much like to
hear some thoughts on this.

Thank you for listening,
Toke Eskildsen
To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message