Return-Path: Delivered-To: apmail-lucene-lucy-dev-archive@minotaur.apache.org Received: (qmail 7419 invoked from network); 28 Dec 2009 00:46:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Dec 2009 00:46:25 -0000 Received: (qmail 38789 invoked by uid 500); 28 Dec 2009 00:46:25 -0000 Delivered-To: apmail-lucene-lucy-dev-archive@lucene.apache.org Received: (qmail 38724 invoked by uid 500); 28 Dec 2009 00:46:25 -0000 Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@lucene.apache.org Delivered-To: mailing list lucy-dev@lucene.apache.org Received: (qmail 38714 invoked by uid 99); 28 Dec 2009 00:46:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Dec 2009 00:46:25 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.216.186] (HELO mail-px0-f186.google.com) (209.85.216.186) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Dec 2009 00:46:15 +0000 Received: by pxi16 with SMTP id 16so3300196pxi.20 for ; Sun, 27 Dec 2009 16:45:53 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.56.6 with SMTP id e6mr9695171wfa.158.1261961153252; Sun, 27 Dec 2009 16:45:53 -0800 (PST) In-Reply-To: <20091221030851.GA9295@rectangular.com> References: <20091218023846.GC22367@rectangular.com> <65d3176c0912191236k34f8b774mde7d69fd4bf8ccf6@mail.gmail.com> <20091221030851.GA9295@rectangular.com> Date: Sun, 27 Dec 2009 16:45:53 -0800 Message-ID: <65d3176c0912271645q6ec1f686j15f627e8c0c9cb99@mail.gmail.com> Subject: Re: SortWriter memory costs From: Nathan Kurz To: lucy-dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Sun, Dec 20, 2009 at 7:08 PM, Marvin Humphrey w= rote: > On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote: >> > Over on the Lucene java-user list, we're discussing how to rein in >> > SortWriter's memory costs: >> > >> > =C2=A0http://markmail.org/message/zcgb6ak24di42rev >> >> I'm not sure that I understand your approach here. =C2=A0Could you offer >> some context? > > Sure, the easiest way to do that is to explain how SortWriter works now a= nd > why it consumes too much memory. =C2=A0I'll use JavaScript syntax for the > pseudocode.[1] Thanks for the detailed description. I'm more confident in my understanding now, although I'm sure there are still gaps. The Javascript is a fine approach, although I think pseudo-C would be fine too. > Our problem is the unique_values hash. =C2=A0For fields with low cardinal= ity -- for > instance if our "category" field has only a few possible values -- it nev= er > gets very big. =C2=A0But for fields with high cardinality -- such as a "p= roduct_id" > field where every value is unique -- it gets huge and consumes a ton of R= AM. > > That's what we need to solve. Patient -- Doctor, every time I poke myself in the eye with this stick I get a sharp pain. Doctor -- Well then stop poking yourself in the eye with a stick! Patient -- Every time I try to sort by a text field on a gigantic index for what should be a numeric field, my memory usage blows up. Doctor -- Well, I suppose we could wrap the end of the stick with this dirty rag to see if blunts the pain... My dominant impression is that converting text fields to ordinals to allow easier sorting is an ugly hack that should not be encouraged. Yes, it works if you're willing to tolerate falling back on alternate solutions for sharding and segments, able to live with inefficient joins with text-matching, and are OK with making your indexes essentially unmodifiable, But isn't there a better way? But let's come back to that later... > When we finish off the segment, the present SortWriter writes three files= for > each sortable text field: ords, offsets, and UTF-8 character data: > > =C2=A0 seg_1/sort-2.ord =C2=A0// <--- The "2" is the field number. > =C2=A0 seg_1/sort-2.ix > =C2=A0 seg_1/sort-2.dat To check my understanding: 'ord' is for ordinal, and conceptually is the same as the term-id in a sorted lexicon. The '.ord' file is an array of ordinals indexed by doc-id (although bit-compressed). Conceptually, looking up ord[doc-id] would give you the ordinal of the value of 'category' for doc-id. 'ix' is for index, and 'dat' is for data. The '.ix' file is an array of offsets into the '.dat' file, so that to look up the text representation of ordinal 'o', one would start reading at dat[o]. Wrapping, one looks up the text representation of 'category' for doc-id using dat[ord[doc-id]]. > First, SortWriter creates an array of unique sorted values by sorting the= keys > in the unique_values hash. It seems like the main problem would be the overhead of the hash. Are you trying to use host language hashes, or are you doing something custom? My first thought would be to write a simple custom hash that stores offsets into '.dat' rather than values. If a string isn't in the hash, append it to the .dat file (lexicon) and save the offset. Common strings will thus tend to appear at the start of the file, rare ones at the end. Given the fixed width of the offsets, you should be able to get the overhead of the hash down to something very small. > Right now, SortWriter is the only DataWriter component that has a problem= with > RAM footprint -- all the other components (LexWriter, DocWriter, > PostingsWriter, etc) do a good job of keeping RAM usage under control and > scale up beyond the practical size for an index on one machine. This is all per-segment, right? How many docs are you planning to target per segment? How many do you have to get to before the RAM usage is out of control? 100 million seems like a lot, to me. 100 million docs with a plain text Unique_ID field and each entry in the hash taking up 100 bytes (worst case current?) would be 10GB, and at the boundary of in-control. If you can get that overhead down to 10 bytes, (1 GB) it seems like you'd be firmly in control again, even if you have multiple such fields. And if not, you could do always do 1 million docs per segment and merge them later. > Yes. =C2=A0Instead of using a hash to perform the uniquing, we'd use a fi= le. > > Think of it this way: we're dispensing with the unique_values hash and us= ing > only the doc_values array. =C2=A0At segment finish time, we use that doc_= values > array to create a sorted list of doc_ids. OK, so you don't worry about duplicating strings in the data file --- you just write out everything and save the offset. Then when you are done, you sort the index file by the values of the stored strings. This seems like a small win in the case of 'Product ID', and a major loss for a field like 'Gender' that has only a limited number of values. Do you really want to indirectly sort 100,000,000 strings that have only two (OK, 7 for the Bay Area) distinct values? The windowing function Toke alludes to would solve most of this, but once you're doing this it doesn't seem that much harder to go all the way and do a low-overhead hash. My overall impression is that this approach would work, but that there are some tradeoffs and it doesn't really buy you that much overall efficiency. Either you fit in RAM or you don't --- once you don't, even an SSD isn't going to keep performance from diving off the cliff. >> I'd wonder how you'd merge results between shards or sort the results >> of text query. > > At search time, we use the mmap'd ords array to perform all sort comparis= ons > within a segment -- which is lightning fast. =C2=A0:) =C2=A0For compariso= n across > segments, we use actual values, which is slower but not that expensive in= the > grand scheme of things, since it scales with the number of segments rathe= r > than the number of documents. I'm still not quite understanding your approach. I just found this email with GMail by searching for 'marvin', so let's use that as a base case for ordering. Making up syntax, I enter 'marvin' in the box and Google translates it to something like this: "MATCH marvin IN all-fields ORDER BY date LIMIT 10". There are several ways to handle this query. Which one are you aiming for? 1) Incrementally find hits for 'marvin', lookup ord[doc-id] at the time of each hit, score as 1/ord, and keep the top 10. 2) Find all hits for marvin, sort them all using the mmapped ord array, return the top 10. 3) Find all hits for marvin, then use this to filter an ord sorted list of doc-ids, stopping once one has found 10 matches. It sounds like you are aiming for 2), but this seems like it would be unwieldy when searching for a common term: it seems expensive to sort a list that big. I'm also not sure how to break ties by using the score for the text term. I'm most comfortable with 3) because of it's flexibility, but can see the advantage of 1) for efficiency on rare terms. If you were to do something closer to 2), I'd be tempted to just look up offset[doc-id] and use the string directly to do the sorting. >> What's the main use case you are targetting? > > All queries with a SortSpec that sorts on a field and all RangeQueries us= e our > mmap'd sort caches. The explanation up to here is wonderfully clear, but this seems little circular as a use case. Is the Gmail example a reasonable one? Is there a better one? For me, I think I useful case to think about would involve both full text search and a range query, with results ordered by the value used in the range query and ties broken by strength of text match. Something like "MATCH word IN title WHERE rating > 3 ORDER BY rating DESCENDING". In almost all the non-contrived cases I can come up with, the ranges are dates or numerics, not text fields. I can come up with some cases where I'd want to order alphabetically, but not that many. Which brings me back to my opening rant... Ordinals serve as a minimal monotonic perfect hash for the input strings. The actual value of the ordinal is not that useful --- more commonly, you just want a fast way of comparing two strings, and comparing two ordinals turns is faster. If the lexicon itself is ordered, the offsets into the lexicon (data file), while not minimal, can will compare the same. Since the cases that interest me most of those where the index can be modified without being rebuilt, I'd rather not require that the lexicon itself be ordered, since it's awfully convenient just to add a new string by appending. I'm not even sure you need to have the ordinals precalculated at all --- an ordered list of (doc-id, offset) pairs seems like it would be sufficient. To be editable, you'd probably need to keep this in a BTree variant, or some other page oriented ordered data structure. The nice part of this is that it would be entirely pay-as-you-go: if you add to this file directly, you'd have minimal memory usage at all steps of the process, and once you've added the last document, there's no giant sorting step that needs to happen. This would be hard, though, so I'd suggest just sticking with your current approach and making your hash slightly more efficient. Then think about how you can handle non-text fields directly --- instead of ords and offsets, handle dates, times, ints, floats, ratings etc. directly. You can keep using an ord file for text, but treat this as an optimization rather than a requirement. And think about how to make the intersegment merges efficient. One thing I'm becoming more confident of is that handling of segments and shards should be identical. Multi-segment merges should not require direct access to in-memory data structures, rather all data necessary for aggregation should be presented along with the results. The results returned from a single segment, from a shard, or from the whole shabang should be interchangeable. If you can do this efficiently, I think you are pretty much guaranteed to scale to many-cores as well as many-machines. Sorry if the ranting was too incoherent, Nathan Kurz nate@verse.com