incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Kurz" <>
Subject Re: Posting codecs
Date Mon, 29 Sep 2008 20:47:05 GMT
On Sat, Sep 27, 2008 at 3:14 PM, Marvin Humphrey <> wrote:
> What if the index is large enough that it we exceed the system's addressable
> space?

Sadly, I haven't considered this problem at all.  I just naively
thought it wasn't going to be a problem.  And for my own uses, I think
this is true: I've been looking at systems that fit in RAM running on
a 64-bit OS.  Either of these conditions alone should make it that
addressable space is the first limit one hits. But yes, if you have a
32-bit OS with a <4GB address space and a large index, you could run
up against this limit pretty quickly.

Perhaps not as quickly as you'd presume, though.  The way I was
looking at it, it's not the total size of the index that matters, only
the total size of the posting lists involved in the query.   I was
thinking to mmap() each posting list independently.  Even so, with no
stop-words and a sufficiently complex query, you could probably run
out of address space with a <10GB index.

My first instinct would be to ignore the problem, and declare 32-bit
systems as suitable for development, testing,  and small-scale home
use only.  My alternative instinct (perhaps no less unreasonable)
would be to limit index segments to 4GB on 32-bit systems, and
map/unmap on a per-segment basis.

In my mind, once we need to load from disk a subset of index larger
than addressable space, we've already given up on performance and are
just trying to slog through without undue embarrassment.  I'm sure
there are use-cases where this is not the case, however.

I would try very hard to avoid doing buffer bounds checking.  The code
will be so much simpler and faster if it is not done.   If it truly
needs to be done, I would figure out a way to encapsulate it within
the index reader, and just guarantee to the higher levels that they
can read through to end of the current posting without overstepping

> Also, many posting formats won't have a theoretical maximum length.  The
> current implementation of ScorePosting doesn't for instance, because there's no
> ceiling on the number of positions.

True, but I'd hope it's reasonable to require that the collectiive
postings being used to score an individual document can fit into
addressable memory, if not physical memory?  While there is no
theoretical limit on the length of a posting, imposing a practical
limit here seems reasonable.

> We should ape Lucene and give PostingList a Bulk_Read() method, which gets
> called by the TermScorer subclass.

I still don't think this would be a win, at least not for any postings
that include position information.  If you read in a large enough list
to have significant avoidance of function overhead, you'll just blow
your L1 or L2 cache.   For match-only postings there might be a sweet
spot, but I think the real win will still be from making it very fast
to fill a single posting from the raw index.

Nathan Kurz

View raw message