Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Toke Eskildsen <te@statsbiblioteket.dk>
To: "dev@lucene.apache.org" <dev@lucene.apache.org>
Date: Sat, 20 Nov 2010 12:49:24 +0100
Subject: Where do I go with ordinals?
Thread-Topic: Where do I go with ordinals?
Thread-Index: AQHLiKj6gB/Ecb0InkmEtJFkrF4mpw==
Message-ID: 
 <2E6A89A648463A4EBF093A9062C166830345BB4082D8@SBMAILBOX1.sb.statsbiblioteket.dk>
Accept-Language: en-US, da-DK
Content-Language: en-GB
acceptlanguage: en-US, da-DK
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

A while ago I started on LUCENE-2369 with the goal making fast and memory-e=
fficient locale-aware sorting in Lucene. During this I=92ve wandered quite =
a bit out of course while exploring the possibilities in ordinals. I=92ve d=
iscovered that two building blocks enables sorting, (hierarchical) faceting=
, index lookup and range searches in a way that supports custom ordering an=
d has the trade offs =93slow initialization, fast execution, low memory ove=
rhead=94. I would like to share my thoughts on this and I would like to hea=
r if and how I should move forward with this.

At the core, this is about using ordinals instead of the terms themselves. =
Index-wide ordinals. For Lucene trunk, SegmentReaders supports ordinal-to-t=
erm resolving directly, albeit not super optimized. Other IndexReaders does=
 not support this out of the box and there=92s the whole problem of how to =
handle duplicate terms. As part of LUCENE-2369 I implemented

Building block 1: A map from custom ordered positions to term ordinals, wit=
hout duplicates. Example: Let=92s look at the terms for a given field split=
 over 3 segments: Segment 1: A, B, C Segment 2: B, D Segment 3: A, E The in=
dex-wide ordered list of the terms would be A, B, C, D, E. The map from ord=
ered positions to ordinals would be (segment 1, ordinal 0), (s 1, o 2), (s =
1, o 3), (s 2, o 1), (s 3, o 2).

This index-level map takes up #logical_positions*log2(#terms) bits (PackedI=
nts to the rescue), rounded up. For the example above, this is 5*log2(7) bi=
ts. Scaling up to 10 million terms, out of which 5 million is unique, we ha=
ve 5M*log2(10M) bits =3D 5M*24 bits =3D 14 MByte.

If custom order is used, we would like to cache the calculated segment orie=
nted maps for faster re-opening of the index. As there are no duplicates in=
side of a segment, the sum of the segment maps will be a bit more than #ter=
ms*log2(#terms) bits. For 10M terms, this is 29 MByte extra, bringing the t=
otal memory requirement up to 43 MByte.

Building the map requires the sorting of the ordinals at segment level (thi=
s can be skipped if the index order is to be used) plus a merge-sort of the=
 segment oriented ordinals to get the index oriented ordered positions. Thi=
s map could be used for custom order range-queries by doing two binary sear=
ches and iterating between the points. It could also be used for index look=
up. Of course, one might just make a field with a custom order at index tim=
e (long live flex), so maybe this way of doing it has little value. However=
, the building block is a stepping stone for

Building block 2: A map from document IDs to ordered positions for a given =
field. Building this map is in principle a matter of stepping through the t=
erms from building block 1 and requesting all the document IDs for the term=
s from all segments. Some memory optimization can be done if there is only =
1 term/document, but the general case is handled by keeping an array of ent=
ry points for all document IDs into an array of ordered term positions for =
the document IDs.

Memory requirements for this map is #documents*log2(#documents) + #term_ref=
erences*log2(#logical terms) bits. For 20M documents with 40M references to=
 5M unique terms (aka ordered positions) this is 20M*log2(20M) + 40M*log2(5=
M) bits =3D 20M*25 + 40M*23 bits =3D 228 MByte. On top of that comes the me=
mory requirements for building block 1.

With building block 1 + 2 we can implement the rest of the functionality:

Sorting is trivial: The order of two documents is determined by comparing t=
he first integer from building block 2 for each document. We could also spe=
cialize building block 2 to exactly 1 term/document to save some memory and=
 gain a little more speed.

Faceting is simple: A counting array (the obvious choice is to use int[] or=
 long[]) of length =93#ordered_positions from building block 1=94 is create=
d. The document IDs for all the hits from a search is used to get the order=
ed positions from building block 2 and the corresponding entries in the cou=
nting array is incremented. Extracting the tags by count is done by iterati=
ng the counting array with a priority queue, finding the top-X tags ordinal=
s, then extracting the terms themselves with the map from building block 1.=
 Extracting by order is done by iterating and collecting the indexes of all=
 entries where the count is > x (typically 1). As far as I understand, this=
 is more or less the approach used by Solr=92s field cache faceting.

Hierarchical faceting is more complex. Let=92s leave it for another post an=
d just say that augmenting building block 1 with about 1 byte/logical term =
gives us what we want. The code as well as unit tests is in LUCENE-2369 for=
 the curious.


There is working code that we use in-house with a test-setup. It needs clea=
nup and polishing, but does not require changes to existing Lucene code. It=
 could be part of Lucene core (for sorting and range search at least), it c=
ould be a contrib, it might belong in Solr (SOLR-64 or maybe SOLR-792), Bob=
o-browse or somewhere else. My personal preference is to make a Lucene cont=
rib so that the maximum amount of projects can benefit, but I would very mu=
ch like to hear some thoughts on this.

Thank you for listening,
Toke Eskildsen=

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org