Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of randall.leeds@gmail.com
 designates 209.85.210.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJ_m3YBrX0iCrMA4JA7XE0CSmc2rwzFPZsSsN+vVUbP0YukVNw@mail.gmail.com>
References: 
 <CABn9xAE_mMYrcsvrnyVnQ3WZSNW2kni3BmnyzXWOGQM70T_Y9Q@mail.gmail.com>
	<CAAL6JQg+a8vMQ-rV5duw2zO_9GBMSM=3p0D+5GitO5SJxq6M0w@mail.gmail.com>
	<CAJ_m3YB3iQ77SDN=8NwDLz4GErxZU6vbyWxOvOHRPveyLBff2Q@mail.gmail.com>
	<CABn9xAF=dNrAfZ_HZFnFzrmYNM7iw-T4qURN4aqJcdUK5An=FA@mail.gmail.com>
	<CAJ_m3YBrX0iCrMA4JA7XE0CSmc2rwzFPZsSsN+vVUbP0YukVNw@mail.gmail.com>
Date: Mon, 23 Jan 2012 20:20:05 -0800
Message-ID: 
 <CAAL6JQgfZujdhYft0zhE5_g8vJPRgAmcw+cR5q3o3CuOSN5r-w@mail.gmail.com>
Subject: Re: How are key values stored in B+ tree nodes on-disk?
From: Randall Leeds <randall.leeds@gmail.com>
To: dev@couchdb.apache.org
Content-Type: text/plain; charset=UTF-8

On Mon, Jan 23, 2012 at 19:49, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> On Mon, Jan 23, 2012 at 9:28 PM, Riyad Kalla <rkalla@gmail.com> wrote:
>> <resending, Gmail is telling me this never sent...>
>>
>>
>> Randall and Paul, I appreciate the explanation and links to the exact spots
>> in code. This helped a lot.
>>
>> I've been reading up on alternative database impls; InnoDB, for example,
>> caps the key length at some upper bound and then truncates the key and puts
>> the remainder in an overflow table; naturally this requires 1 more random
>> I/O every time you match a long key. At the extreme case (where every key
>> doesn't fit and parts of it are in the overflow table) I think you end up
>> with key-table idea you were toying with Paul, or something related.
>>
>> Paul, I feel your pain re: alternative approaches to this. If you can do
>> away with RANGE queries, moving to MD5 or SHA hashes (or something even
>> faster like MurmurHash) is a nice way to get consistently sized keys which
>> gives you the advantage of paging in a block in raw bytes and only
>> deserializing the 16 or 20-byte chunks in the positions of a standard
>> binary-search until you find what you want instead of the whole node.
>>
>> I *really* want to find a way to make that work, but people seem to like
>> RANGE queries.
>>
>> I asked this question on SO and a gentleman suggested a header at the
>> beginning of the block that actually lists the offsets of all the keys in
>> the node. For node/block sizes of 32KB of smaller (which seem reasonable),
>> you could safely get away with a 16-bit int for storing the offsets;
>> assuming a typical key size, I imagine you wouldn't have more than 100 or
>> 200 keys in a node; assume 150, that is 300 bytes worth of offsets at the
>> beginning of the block. So we spend 7% (assuming 4096kb block size) of the
>> block providing offsets to help make key searching faster. If you were
>> willing to cap blocks to no more than 255 keys you could use 8-bit offset
>> values (meaning any key smaller than 8 bytes would get you in trouble).
>>
>> A limitation to this idea though is that when you need to overflow a single
>> block because you are indexing so many or such large keys, you now end up
>> with two blocks next to each other with 1 of them having no header.
>>
>> You can fix this by having a hard-set-block-size for the DB (say 4KB) that
>> never changes and can never be configured. That way every block,
>> everywhere, has a very tiny "type" header; maybe 8-bit. If the "normal" bit
>> is set, then that block would have an offset list at the beginning of it
>> before the key data. If the "overflow" bit was set, than that block is a
>> continuation of the block before it, and right after the 8-bit block header
>> is raw key data that needs to be considered.
>>
>> Just an idea, I am sure there are more elegant ways to solve this. I am
>> more curious if there is something here... something interesting and worth
>> considering, or if I am just enjoying the smell of my own farts over here ;)
>>
>> -R
>>
>
> I would say that the important constraints in changing this is that
> you can't change the existing behavior without a *very* good reason.
> For instance, there's currently no limit on the key length in the
> btree. Changing this (even to something "long" like 65535 bytes) is
> still a behavior change. And removing range queries is pretty much out
> without changing a major API endpoint.
>
> That said I invite you to play with this code. I've spent a good deal
> of time trying to make it better but kept coming up short when
> measuring performance (and its possible that my measurements weren't
> completely representative either). If you come up with anything give
> me a holler and I'd love to go through and see what you come up with.

Hear, hear! Play with the code!