Return-Path: X-Original-To: apmail-couchdb-dev-archive@www.apache.org Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A2E76937D for ; Tue, 24 Jan 2012 04:20:55 +0000 (UTC) Received: (qmail 88400 invoked by uid 500); 24 Jan 2012 04:20:52 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 87729 invoked by uid 500); 24 Jan 2012 04:20:41 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 87712 invoked by uid 99); 24 Jan 2012 04:20:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jan 2012 04:20:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of randall.leeds@gmail.com designates 209.85.210.180 as permitted sender) Received: from [209.85.210.180] (HELO mail-iy0-f180.google.com) (209.85.210.180) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jan 2012 04:20:25 +0000 Received: by iabz7 with SMTP id z7so7574967iab.11 for ; Mon, 23 Jan 2012 20:20:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=FVvdu8bE790XwpYDAGBtq3fGJR+cF0EMDJCaCZgxzZQ=; b=UOZ0ftJoUHf/iJa0ZaVY6A0os9IBZK0gBzESU+VgDFPe1pq3wUOvUPCpk5G9DqS853 tzqxfGO4IFektlkcHZGsEm1s/XLn8eVh0XBfFsDdrQrGdeXLwRV46a4dtHq4Fh8SorN+ al69M+ibX7FLuWnayvmjtJMbuEPCNeBXg/A0g= MIME-Version: 1.0 Received: by 10.42.146.202 with SMTP id k10mr10334561icv.13.1327378805072; Mon, 23 Jan 2012 20:20:05 -0800 (PST) Received: by 10.42.165.72 with HTTP; Mon, 23 Jan 2012 20:20:05 -0800 (PST) In-Reply-To: References: Date: Mon, 23 Jan 2012 20:20:05 -0800 Message-ID: Subject: Re: How are key values stored in B+ tree nodes on-disk? From: Randall Leeds To: dev@couchdb.apache.org Content-Type: text/plain; charset=UTF-8 On Mon, Jan 23, 2012 at 19:49, Paul Davis wrote: > On Mon, Jan 23, 2012 at 9:28 PM, Riyad Kalla wrote: >> >> >> >> Randall and Paul, I appreciate the explanation and links to the exact spots >> in code. This helped a lot. >> >> I've been reading up on alternative database impls; InnoDB, for example, >> caps the key length at some upper bound and then truncates the key and puts >> the remainder in an overflow table; naturally this requires 1 more random >> I/O every time you match a long key. At the extreme case (where every key >> doesn't fit and parts of it are in the overflow table) I think you end up >> with key-table idea you were toying with Paul, or something related. >> >> Paul, I feel your pain re: alternative approaches to this. If you can do >> away with RANGE queries, moving to MD5 or SHA hashes (or something even >> faster like MurmurHash) is a nice way to get consistently sized keys which >> gives you the advantage of paging in a block in raw bytes and only >> deserializing the 16 or 20-byte chunks in the positions of a standard >> binary-search until you find what you want instead of the whole node. >> >> I *really* want to find a way to make that work, but people seem to like >> RANGE queries. >> >> I asked this question on SO and a gentleman suggested a header at the >> beginning of the block that actually lists the offsets of all the keys in >> the node. For node/block sizes of 32KB of smaller (which seem reasonable), >> you could safely get away with a 16-bit int for storing the offsets; >> assuming a typical key size, I imagine you wouldn't have more than 100 or >> 200 keys in a node; assume 150, that is 300 bytes worth of offsets at the >> beginning of the block. So we spend 7% (assuming 4096kb block size) of the >> block providing offsets to help make key searching faster. If you were >> willing to cap blocks to no more than 255 keys you could use 8-bit offset >> values (meaning any key smaller than 8 bytes would get you in trouble). >> >> A limitation to this idea though is that when you need to overflow a single >> block because you are indexing so many or such large keys, you now end up >> with two blocks next to each other with 1 of them having no header. >> >> You can fix this by having a hard-set-block-size for the DB (say 4KB) that >> never changes and can never be configured. That way every block, >> everywhere, has a very tiny "type" header; maybe 8-bit. If the "normal" bit >> is set, then that block would have an offset list at the beginning of it >> before the key data. If the "overflow" bit was set, than that block is a >> continuation of the block before it, and right after the 8-bit block header >> is raw key data that needs to be considered. >> >> Just an idea, I am sure there are more elegant ways to solve this. I am >> more curious if there is something here... something interesting and worth >> considering, or if I am just enjoying the smell of my own farts over here ;) >> >> -R >> > > I would say that the important constraints in changing this is that > you can't change the existing behavior without a *very* good reason. > For instance, there's currently no limit on the key length in the > btree. Changing this (even to something "long" like 65535 bytes) is > still a behavior change. And removing range queries is pretty much out > without changing a major API endpoint. > > That said I invite you to play with this code. I've spent a good deal > of time trying to make it better but kept coming up short when > measuring performance (and its possible that my measurements weren't > completely representative either). If you come up with anything give > me a holler and I'd love to go through and see what you come up with. Hear, hear! Play with the code!