incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject SortWriter memory costs
Date Mon, 21 Dec 2009 03:08:51 GMT
On Sat, Dec 19, 2009 at 12:36:37PM -0800, Nathan Kurz wrote:
> > Over on the Lucene java-user list, we're discussing how to rein in
> > SortWriter's memory costs:
> >
> >
> I'm not sure that I understand your approach here.  Could you offer
> some context?  

Sure, the easiest way to do that is to explain how SortWriter works now and
why it consumes too much memory.  I'll use JavaScript syntax for the

Assume each document in your index has a field, "category", which has been
marked as "sortable".  Assume that "category" has been assigned the field
number 2 for this segment.

As docs get added, SortWriter keeps track of each doc's value for the field
"category" in both a hash and an array.  

The hash stores unique values:

  var category = doc.fetch('category');
  sort_writer.unique_values['category'][category] = -1;

The array records the value for each doc_id:

  var unique_val = sort_writer.unique_values['category].find_key(category);
  sort_writer.doc_values['category'][doc_id] = unique_val;

Since the "doc_values" array only stores pointers to keys in the
"unique_values" hash, it doesn't grow very fast -- just one pointer per doc.

Our problem is the unique_values hash.  For fields with low cardinality -- for
instance if our "category" field has only a few possible values -- it never
gets very big.  But for fields with high cardinality -- such as a "product_id"
field where every value is unique -- it gets huge and consumes a ton of RAM.

That's what we need to solve.

When we finish off the segment, the present SortWriter writes three files for
each sortable text field: ords, offsets, and UTF-8 character data:

   seg_1/sort-2.ord  // <--- The "2" is the field number.

First, SortWriter creates an array of unique sorted values by sorting the keys
in the unique_values hash.  

  var sorted = sort_writer.unique_values[category].keys();

The sort positions are then saved as values in the unique_values hash.

  for (var ord = 0; ord < sorted.length; ord++) {
    var key = sorted[ord];
    sort_writer.unique_values['category'][key] = ord;

The sorted values are written out to disk as two files: offsets, and character

  var ix_out  = sort_writer.folder.open_out('seg_1/sort-2.ix');
  var dat_out = sort_writer.folder.open_out('seg_1/sort-2.dat');
  for (var i = 0; i < sorted.length; i++) {
    var unique_val = sorted[i];
    dat_out.write_bytes(unique_val.get_ptr8(), unique_val.get_size());

Next, we create the ordinals array by iterating over the doc_values array and
looking up each ord via the unique_values hash.

  for (var doc_id = 1; doc_id <= doc_max; doc_id++) {
    var key = sort_writer.doc_values['category'][doc_id];
    ords[doc_id] = sort_writer.unique_values['category'][key];

The ords file is simply this ordinals array blasted out to disk after
compressing it to a smaller bit width appropriate to its cardinality. (2
unique values = one bit per doc, 13 unique values = four bits per doc, etc).

  var outstream = sort_writer.folder.open_out('seg_1/sort-2.ord');
  outstream.write_bytes(compressed_ords, compressed_ords_len);

Right now, SortWriter is the only DataWriter component that has a problem with
RAM footprint -- all the other components (LexWriter, DocWriter,
PostingsWriter, etc) do a good job of keeping RAM usage under control and
scale up beyond the practical size for an index on one machine.

> The plan is to store one giant file of all the field values, 

Yes.  Instead of using a hash to perform the uniquing, we'd use a file.

Think of it this way: we're dispensing with the unique_values hash and using
only the doc_values array.  At segment finish time, we use that doc_values
array to create a sorted list of doc_ids.

  var sorted_doc_ids = new Array(doc_max + 1);
  for (var i = 0; i <= doc_max; i++) { sorted_doc_ids[i] = i; }

However, instead of actually storing CharBuf objects in the doc_values array,
we're going to save memory by storing file offsets pointing at a temporary
file where we've *serialized* those values.  During indexing, that costs us
one 64-bit integer per doc per sortable field, which will scale up OK.

To perform comparisons during sorting, we deserialize values from the
temporary file two at a time.

  var instream = sort_writer.folder.open_in('seg_1/sort_cache_writer_temp');
  var value_a  =;
  var value_b  =;

  function compare_doc_ids_by_doc_values(doc_a, doc_b) {
    var position_a = sort_writer.doc_values['category'][doc_a];
    var position_b = sort_writer.doc_values['category'][doc_b];;
    read_value_into_charbuf(instream, value_a);;
    read_value_into_charbuf(instream, value_b);
    return value_a.compare_to(value_b);

We can derive the ords array using the sorted_doc_ids array and the temp file:

  var ord        = 0;
  var ords       = new Array(doc_max + 1);
  var last_value = new CharBuf();
  var this_value = new CharBuf();
  for (var i = 0; i < doc_max; i++) {
    var doc_id = sorted_doc_ids[i];
    var file_position = sort_writer.doc_values['category'][doc_id];;
    read_value_into_charbuf(instream, this_value);
    if (!this_value.equals(last_value)) {
      last_value = this_value;
    ords[doc_id] = ord;

This propose algorithm is similar to the one that Toke Eskildsen described on

However, it has been adapted for some of Lucy's constraints, e.g. no
read/write files.

> I'd wonder how you'd merge results between shards or sort the results
> of text query.   

At search time, we use the mmap'd ords array to perform all sort comparisons
within a segment -- which is lightning fast.  :)  For comparison across
segments, we use actual values, which is slower but not that expensive in the
grand scheme of things, since it scales with the number of segments rather
than the number of documents.

> What's the main use case you are targetting?

All queries with a SortSpec that sorts on a field and all RangeQueries use our
mmap'd sort caches.

Marvin Humphrey

[1] C isn't expressive enough for code samples, especially since it lacks
    syntax for hash tables and arrays.  Perl, Ruby, and Python all suffice,
    but I don't want to favor one binding language over another.  (You know
    Perl, Mike knows Python, who gets excluded from the conversation blah blah
    blah.)  JavaScript, which we *can't* write bindings for because it lacks
    IO, is sufficiently expressive, spawned the JSON we use in our metadata
    files, and is popular enough that most people should grok it.

View raw message