lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject SortCache on a 32-bit OS
Date Sat, 30 Jan 2010 18:11:22 GMT
Greets,

As discussed previously and prototyped in KinoSearch, sort caches consist of
either 2 or 3 "files" (actually, virtual files within the compound file).  

Variable width types:

    .ord
    .ix
    .dat

Fixed width types:

    .ord
    .dat

In the prototype implementation, all of these files get memory-mapped at
SortCache object construction time.  However, for very large indexes, this
poses a problem on 32-bit operating systems.

Multi-gigabyte indexes work fine on a 32-bit OS, provided that you actually
have enough RAM to keep the whole index RAM-resident in the system IO cache.
However, with enough sort fields and enough unique values, you can exceed
the 32-bit address space limitation under the current design.

To solve this problem, I think we ought to mmap the ords file only and use
sequential reads to recover values -- the same way we do with our lexicons.
The price will be slightly increased CPU costs under a couple of
circumstances:

  * Looking up sortable values at the close of each segment when matching.
  * Finding the ord of a term when preparing a range filter for each segment.

The increased CPU costs come from extra seeks, memory maps, and memory copies.
Right now for TextSortCache objects, we use ViewCharBufs and just assign
pointers within the memory mapped region.  We would have to change to real
CharBufs and perform memory copies, since the mapped region would no longer be
mapped for the life of the parent SortCache -- but that's more robust anyway.

I believe that with this plan we can push the index size at which address
space runs out beyond the practical size for a single machine -- even when
you're doing something silly like running a 32-bit OS on a box with 16 gigs of
RAM.

Marvin Humphrey


Mime
View raw message