lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <>
Subject Re: SortCache on a 32-bit OS
Date Sat, 30 Jan 2010 23:22:51 GMT
On Sat, Jan 30, 2010 at 1:15 PM, Marvin Humphrey <> wrote:
> On Sat, Jan 30, 2010 at 12:11:41PM -0800, Nathan Kurz wrote:
>> The window where this choice is beneficial is small:  something like
>> 32-bit systems using 2-4 Gig indexes with multiple sortable fields
>> with unique values.   Unless this is the use case that Eventful needs,
> Well, actually... yes, it is.

Then you should do it!  As long as you are designing it around a real
need, it will probably be a good design choice.

> Indexes can actually grow larger than 2-4 GB on such systems and still
> maintain top performance.  Because 32-bit operating systems can exploit the
> full RAM on a machine and use it for system IO cache, you can have indexes
> over 4 GB that stay fully RAM-resident.

Definitely right, but I'm most interested in cases that allow
searching for full quotes, hence no stop words.  In my mind, once you
can't map in positions for the word 'the', you're done.   The obvious
answer to this is that it's segment size, rather than index size, that
matters here.  But isn't this true of sort caches as well?  They don't
cross segments, do they?

> The problem with running out of address space is that there's no warning
> before catastrophic failure, and then no possibility of recovery short of
> rearchitecting your search infrastructure or installing a new operating
> system.  It's a really serious glitch to hit.  It would suck if Eventful hit
> it, but I really don't want anybody else to hit it either.

OK, but you can pretty well catch this at index creation time, can't
you?  And even failing at run time with a clear error (mmap failed:
too large to map) might be preferable to the sticky morass of a
steeply declining performance curve once you start to swap.

> I should specify that the extra calls to mmap() and munmap() occur on 32-bit
> systems only.  For 64-bit systems, we mmap() the whole compound file the
> instant it gets opened, and InStream_Buf() is just a thin wrapper around some
> pointer math.

I had not realized that.  This softens my position considerably.  I'm
all for making increasing legacy performance so long as it doesn't
complicate the mainline architecture.

>> Sure, these systems will exist, but solve the problem in way that benefits
>> everyone:  shard it!
> Well, that sort of sharding is not within the scope of Lucy itself.  It's a
> Solr-level solution.

Remind me again:  what's the difference between multiple segments and
sequential sharding?  And if you take that world-view, what stops you
from processing segments in parallel rather than sequentially? :)
Yes, you probably don't want to do all the cross-machine process
management, but designing the architecture so that it's possible to
aggregate and  sort results from multiple queries seems well within

Nathan Kurz

View raw message