Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 47955 invoked from network); 20 Nov 2010 11:49:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 20 Nov 2010 11:49:23 -0000 Received: (qmail 9981 invoked by uid 500); 20 Nov 2010 11:49:54 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 9788 invoked by uid 500); 20 Nov 2010 11:49:53 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 9781 invoked by uid 99); 20 Nov 2010 11:49:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Nov 2010 11:49:53 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [130.225.24.68] (HELO sbexch03.sb.statsbiblioteket.dk) (130.225.24.68) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Nov 2010 11:49:45 +0000 Received: from SBMAILBOX1.sb.statsbiblioteket.dk ([10.150.5.1]) by sbexch03.sb.statsbiblioteket.dk ([10.150.5.3]) with mapi; Sat, 20 Nov 2010 12:49:24 +0100 From: Toke Eskildsen To: "dev@lucene.apache.org" Date: Sat, 20 Nov 2010 12:49:24 +0100 Subject: Where do I go with ordinals? Thread-Topic: Where do I go with ordinals? Thread-Index: AQHLiKj6gB/Ecb0InkmEtJFkrF4mpw== Message-ID: <2E6A89A648463A4EBF093A9062C166830345BB4082D8@SBMAILBOX1.sb.statsbiblioteket.dk> Accept-Language: en-US, da-DK Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, da-DK Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org A while ago I started on LUCENE-2369 with the goal making fast and memory-e= fficient locale-aware sorting in Lucene. During this I=92ve wandered quite = a bit out of course while exploring the possibilities in ordinals. I=92ve d= iscovered that two building blocks enables sorting, (hierarchical) faceting= , index lookup and range searches in a way that supports custom ordering an= d has the trade offs =93slow initialization, fast execution, low memory ove= rhead=94. I would like to share my thoughts on this and I would like to hea= r if and how I should move forward with this. At the core, this is about using ordinals instead of the terms themselves. = Index-wide ordinals. For Lucene trunk, SegmentReaders supports ordinal-to-t= erm resolving directly, albeit not super optimized. Other IndexReaders does= not support this out of the box and there=92s the whole problem of how to = handle duplicate terms. As part of LUCENE-2369 I implemented Building block 1: A map from custom ordered positions to term ordinals, wit= hout duplicates. Example: Let=92s look at the terms for a given field split= over 3 segments: Segment 1: A, B, C Segment 2: B, D Segment 3: A, E The in= dex-wide ordered list of the terms would be A, B, C, D, E. The map from ord= ered positions to ordinals would be (segment 1, ordinal 0), (s 1, o 2), (s = 1, o 3), (s 2, o 1), (s 3, o 2). This index-level map takes up #logical_positions*log2(#terms) bits (PackedI= nts to the rescue), rounded up. For the example above, this is 5*log2(7) bi= ts. Scaling up to 10 million terms, out of which 5 million is unique, we ha= ve 5M*log2(10M) bits =3D 5M*24 bits =3D 14 MByte. If custom order is used, we would like to cache the calculated segment orie= nted maps for faster re-opening of the index. As there are no duplicates in= side of a segment, the sum of the segment maps will be a bit more than #ter= ms*log2(#terms) bits. For 10M terms, this is 29 MByte extra, bringing the t= otal memory requirement up to 43 MByte. Building the map requires the sorting of the ordinals at segment level (thi= s can be skipped if the index order is to be used) plus a merge-sort of the= segment oriented ordinals to get the index oriented ordered positions. Thi= s map could be used for custom order range-queries by doing two binary sear= ches and iterating between the points. It could also be used for index look= up. Of course, one might just make a field with a custom order at index tim= e (long live flex), so maybe this way of doing it has little value. However= , the building block is a stepping stone for Building block 2: A map from document IDs to ordered positions for a given = field. Building this map is in principle a matter of stepping through the t= erms from building block 1 and requesting all the document IDs for the term= s from all segments. Some memory optimization can be done if there is only = 1 term/document, but the general case is handled by keeping an array of ent= ry points for all document IDs into an array of ordered term positions for = the document IDs. Memory requirements for this map is #documents*log2(#documents) + #term_ref= erences*log2(#logical terms) bits. For 20M documents with 40M references to= 5M unique terms (aka ordered positions) this is 20M*log2(20M) + 40M*log2(5= M) bits =3D 20M*25 + 40M*23 bits =3D 228 MByte. On top of that comes the me= mory requirements for building block 1. With building block 1 + 2 we can implement the rest of the functionality: Sorting is trivial: The order of two documents is determined by comparing t= he first integer from building block 2 for each document. We could also spe= cialize building block 2 to exactly 1 term/document to save some memory and= gain a little more speed. Faceting is simple: A counting array (the obvious choice is to use int[] or= long[]) of length =93#ordered_positions from building block 1=94 is create= d. The document IDs for all the hits from a search is used to get the order= ed positions from building block 2 and the corresponding entries in the cou= nting array is incremented. Extracting the tags by count is done by iterati= ng the counting array with a priority queue, finding the top-X tags ordinal= s, then extracting the terms themselves with the map from building block 1.= Extracting by order is done by iterating and collecting the indexes of all= entries where the count is > x (typically 1). As far as I understand, this= is more or less the approach used by Solr=92s field cache faceting. Hierarchical faceting is more complex. Let=92s leave it for another post an= d just say that augmenting building block 1 with about 1 byte/logical term = gives us what we want. The code as well as unit tests is in LUCENE-2369 for= the curious. There is working code that we use in-house with a test-setup. It needs clea= nup and polishing, but does not require changes to existing Lucene code. It= could be part of Lucene core (for sorting and range search at least), it c= ould be a contrib, it might belong in Solr (SOLR-64 or maybe SOLR-792), Bob= o-browse or somewhere else. My personal preference is to make a Lucene cont= rib so that the maximum amount of projects can benefit, but I would very mu= ch like to hear some thoughts on this. Thank you for listening, Toke Eskildsen= --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org