Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <20091221030851.GA9295@rectangular.com>
References: <20091218023846.GC22367@rectangular.com>
	 <65d3176c0912191236k34f8b774mde7d69fd4bf8ccf6@mail.gmail.com>
	 <20091221030851.GA9295@rectangular.com>
Date: Sun, 27 Dec 2009 16:45:53 -0800
Message-ID: <65d3176c0912271645q6ec1f686j15f627e8c0c9cb99@mail.gmail.com>
Subject: Re: SortWriter memory costs
From: Nathan Kurz <nate@verse.com>
To: lucy-dev@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Sun, Dec 20, 2009 at 7:08 PM, Marvin Humphrey <marvin@rectangular.com> w=
rote:
> On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:
>> > Over on the Lucene java-user list, we're discussing how to rein in
>> > SortWriter's memory costs:
>> >
>> > =C2=A0http://markmail.org/message/zcgb6ak24di42rev
>>
>> I'm not sure that I understand your approach here. =C2=A0Could you offer
>> some context?
>
> Sure, the easiest way to do that is to explain how SortWriter works now a=
nd
> why it consumes too much memory. =C2=A0I'll use JavaScript syntax for the
> pseudocode.[1]

Thanks for the detailed description.  I'm more confident in my
understanding now, although I'm sure there are still gaps.  The
Javascript is a fine approach, although I think pseudo-C would be fine
too.

> Our problem is the unique_values hash. =C2=A0For fields with low cardinal=
ity -- for
> instance if our "category" field has only a few possible values -- it nev=
er
> gets very big. =C2=A0But for fields with high cardinality -- such as a "p=
roduct_id"
> field where every value is unique -- it gets huge and consumes a ton of R=
AM.
>
> That's what we need to solve.

Patient -- Doctor, every time I poke myself in the eye with this stick
I get a sharp pain.
Doctor -- Well then stop poking yourself in the eye with a stick!

Patient -- Every time I try to sort by a text field on a gigantic
index for what should be a numeric field, my memory usage blows up.
Doctor -- Well, I suppose we could wrap the end of the stick with this
dirty rag to see if blunts the pain...

My dominant impression is that converting text fields to ordinals to
allow easier sorting is an ugly hack that should not be encouraged.
Yes, it works if you're willing to tolerate falling back on alternate
solutions for sharding and segments, able to live with inefficient
joins with text-matching, and are OK with making your indexes
essentially unmodifiable,   But isn't there a better way?

But let's come back to that later...

> When we finish off the segment, the present SortWriter writes three files=
 for
> each sortable text field: ords, offsets, and UTF-8 character data:
>
> =C2=A0 seg_1/sort-2.ord =C2=A0// <--- The "2" is the field number.
> =C2=A0 seg_1/sort-2.ix
> =C2=A0 seg_1/sort-2.dat

To check my understanding:

'ord' is for ordinal, and conceptually is the same as the term-id in a
sorted lexicon.  The '.ord' file is an array of ordinals indexed by
doc-id (although bit-compressed). Conceptually, looking up ord[doc-id]
would give you the ordinal of the value of 'category' for doc-id.

'ix' is for index, and 'dat' is for data.  The '.ix' file is an array
of offsets into the '.dat' file, so that to look up the text
representation of ordinal 'o', one would start reading at dat[o].
Wrapping, one looks up the text representation of 'category' for
doc-id using dat[ord[doc-id]].

> First, SortWriter creates an array of unique sorted values by sorting the=
 keys
> in the unique_values hash.

It seems like the main problem would be the overhead of the hash.
Are you trying to use host language hashes, or are you doing something
custom?  My first thought would be to write a simple custom hash that
stores offsets into '.dat' rather than values.   If a string isn't in
the hash, append it to the .dat file (lexicon) and save the offset.
Common strings will thus tend to appear at the start of the file, rare
ones at the end.   Given the fixed width of the offsets, you should be
able to get the overhead of the hash down to something very small.

> Right now, SortWriter is the only DataWriter component that has a problem=
 with
> RAM footprint -- all the other components (LexWriter, DocWriter,
> PostingsWriter, etc) do a good job of keeping RAM usage under control and
> scale up beyond the practical size for an index on one machine.

This is all per-segment, right?  How many docs are you planning to
target per segment?
How many do you have to get to before the RAM usage is out of control?

100 million seems like a lot, to me.  100 million docs with a plain
text Unique_ID field and each entry in the hash taking up 100 bytes
(worst case current?) would be 10GB, and at the boundary of
in-control.  If you can get that overhead down to 10 bytes, (1 GB) it
seems like you'd be firmly in control again, even if you have multiple
such fields.  And if not, you could do always do 1 million docs per
segment and merge them later.

> Yes. =C2=A0Instead of using a hash to perform the uniquing, we'd use a fi=
le.
>
> Think of it this way: we're dispensing with the unique_values hash and us=
ing
> only the doc_values array. =C2=A0At segment finish time, we use that doc_=
values
> array to create a sorted list of doc_ids.

OK, so you don't worry about duplicating strings in the data file ---
you just write out everything and save the offset.  Then when you are
done, you sort the index file by the values of the stored strings.

This seems like a small win in the case of 'Product ID', and a major
loss for a field like 'Gender' that has only a limited number of
values.  Do you really want to indirectly sort 100,000,000 strings
that have only two (OK, 7 for the Bay Area) distinct values?

The windowing function Toke alludes to would solve most of this, but
once you're doing this it doesn't seem that much harder to go all the
way and do a low-overhead hash.

My overall impression is that this approach would work, but that there
are some tradeoffs and it doesn't really buy you that much overall
efficiency.  Either you fit in RAM or you don't --- once you don't,
even an SSD isn't going to keep performance from diving off the cliff.

>> I'd wonder how you'd merge results between shards or sort the results
>> of text query.
>
> At search time, we use the mmap'd ords array to perform all sort comparis=
ons
> within a segment -- which is lightning fast. =C2=A0:) =C2=A0For compariso=
n across
> segments, we use actual values, which is slower but not that expensive in=
 the
> grand scheme of things, since it scales with the number of segments rathe=
r
> than the number of documents.

I'm still not quite understanding your approach.  I just found this
email with GMail by searching for 'marvin', so let's use that as a
base case for ordering.  Making up syntax, I enter 'marvin' in the box
and Google translates it to something like this: "MATCH marvin IN
all-fields ORDER BY date LIMIT 10".

There are several ways to handle this query.  Which one are you aiming for?

1) Incrementally find hits for 'marvin', lookup ord[doc-id] at the
time of each hit, score as 1/ord, and keep the top 10.

2) Find all hits for marvin, sort them all using the mmapped ord
array, return the top 10.

3) Find all hits for marvin, then use this to filter an ord sorted
list of doc-ids, stopping once one has found 10 matches.

It sounds like you are aiming for 2), but this seems like it would be
unwieldy when searching for a common term:  it seems expensive to sort
a list that big.  I'm also not sure how to break ties by using the
score for the text term.   I'm most comfortable with 3) because of
it's flexibility, but can see the advantage of 1) for efficiency on
rare terms.

If you were to do something closer to 2), I'd be tempted to just look
up offset[doc-id] and use the string directly to do the sorting.

>> What's the main use case you are targetting?
>
> All queries with a SortSpec that sorts on a field and all RangeQueries us=
e our
> mmap'd sort caches.

The explanation up to here is wonderfully clear, but this seems little
circular as a use case.   Is the Gmail example a reasonable one?   Is
there a better one?

For me, I think I useful case to think about would involve both full
text search and a range query, with results ordered by the value used
in the range query and ties broken by strength of text match.
Something like "MATCH word IN title WHERE rating > 3 ORDER BY rating
DESCENDING".

In almost all the non-contrived cases I can come up with, the ranges
are dates or numerics, not text fields.  I can come up with some cases
where I'd want to order alphabetically, but not that many.

Which brings me back to my opening rant...

Ordinals serve as a minimal monotonic perfect hash for the input
strings.    The actual value of the ordinal is not that useful ---
more commonly, you just want a fast way of comparing two strings, and
comparing two ordinals turns is faster.   If the lexicon itself is
ordered, the offsets into the lexicon (data file), while not minimal,
can will compare the same.

Since the cases that interest me most of those where the index can be
modified without being rebuilt, I'd rather not require that the
lexicon itself be ordered, since it's awfully convenient just to add a
new string by appending.   I'm not even sure you need to have the
ordinals precalculated at all --- an ordered list of (doc-id, offset)
pairs seems like it would be sufficient.

To be editable, you'd probably need to keep this in a BTree variant,
or some other page oriented ordered data structure.  The nice part of
this is that it would be entirely pay-as-you-go:  if you add to this
file directly, you'd have minimal memory usage at all steps of the
process, and once you've added the last document, there's no giant
sorting step that needs to happen.

This would be hard, though, so I'd suggest just sticking with your
current approach and making your hash slightly more efficient.  Then
think about how you can handle non-text fields directly --- instead of
ords and offsets, handle dates, times, ints, floats, ratings etc.
directly.  You can keep using an ord file for text, but treat this as
an optimization rather than a requirement.    And think about how to
make the intersegment merges efficient.

One thing I'm becoming more confident of is that handling of segments
and shards should be identical.  Multi-segment merges should not
require direct access to in-memory data structures, rather all data
necessary for aggregation should be presented along with the results.
 The results returned from a single segment, from a shard, or from the
whole shabang should be interchangeable.  If you can do this
efficiently, I think you are pretty much guaranteed to scale to
many-cores as well as many-machines.

Sorry if the ranting was too incoherent,

Nathan Kurz
nate@verse.com