Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 54279809A for ; Tue, 13 Sep 2011 07:44:50 +0000 (UTC) Received: (qmail 75530 invoked by uid 500); 13 Sep 2011 07:44:50 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 75169 invoked by uid 500); 13 Sep 2011 07:44:46 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 75160 invoked by uid 99); 13 Sep 2011 07:44:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 07:44:44 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mcorgan@hotpads.com designates 209.85.160.169 as permitted sender) Received: from [209.85.160.169] (HELO mail-gy0-f169.google.com) (209.85.160.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 07:44:37 +0000 Received: by gya6 with SMTP id 6so269914gya.14 for ; Tue, 13 Sep 2011 00:44:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.74.17 with SMTP id w17mr2689341yba.122.1315899855673; Tue, 13 Sep 2011 00:44:15 -0700 (PDT) Received: by 10.147.182.10 with HTTP; Tue, 13 Sep 2011 00:44:15 -0700 (PDT) Date: Tue, 13 Sep 2011 00:44:15 -0700 Message-ID: Subject: prefix compression implementation From: Matt Corgan To: dev Content-Type: multipart/alternative; boundary=000e0cd48df6751afb04accdcfab X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd48df6751afb04accdcfab Content-Type: text/plain; charset=UTF-8 Hi devs, I put a "developer preview" of a prefix compression algorithm on github. It still needs some details worked out, a full set of iterators, about 200 optimizations, and a bunch of other stuff... but, it successfully passes some preliminary tests so I thought I'd get it in front of more eyeballs sooner than later. https://github.com/hotpads/hbase-prefix-trie It depends on HBase's Bytes.java and KeyValue.java, which depends on hadoop. Those jars are in there, along with some others for full HFile testing in the near future. A good place to start looking at the code is org.apache.hadoop.hbase.keyvalue.trie.builder.KeyValuePtBuilder. It's the main driver of the compaction side of things, taking KeyValues (in sorted order), and generates a byte[] to be saved as a disk block. Then for reading it back in, there is trie.compact.read.PtIterator which takes the byte[] and spits out KeyValues. The main test class is trie.test.row.RowBuilderTests which round-trips a bunch of KeyValues to make sure the outputs are the same as the inputs. trie.compact.row.node.PtRowNode is the format of a single trie node in the underlying byte[]. The current version generates a lot of little objects (garbage) while writing and reading. I plan to optimize it to the point where most variables are primitives on the stack (at least when reading), but I think these intermediate classes are important for debugging. I'll probably try to keep them around going forward and develop a more compact version in parallel. It uses trie style compression for the row keys and column qualifiers, where pointers between nodes are compacted ints. It keeps a list of compacted, de-duped deltas for timestamps, and if they're all the same, it stores only one (long) per block. If all KeyValue.TYPE operations are the same, then it only stores one (byte) per block. It's designed around efficient cpu cache usage and elimination of 8 byte pointers, so should be fast. Get calls can traverse the trie nodes to dig up a row key while barely loading anything from memory to cache, as opposed to current hbase which may load the better part of a block into cache. Scanners/Filters/Comparators can all be designed to be trie-aware so they can iterate through 20 columnQualifiers in the same row without constantly re-scanning the rowKey bytes... etc... Here are a few questions we'll need to answer at some point: * where should this code go? - i'd vote for keeping it isolated and limiting references back to the main hbase project. sort of like the gzip/lzo algorithms. - if it's strictly isolated, it'll be easier to keep it well tested for correctness/performance and let more people tinker with it to make it faster. it'll also force us to come up with the right interfaces to allow other compression implementations. * current HFileWriter sends KeyValues to the OutputStream as soon as they're processed, but would it be ok to queue up a whole block in memory and write it all at once? - i'd vote for yes. it makes it easier to arrange the data to be more read-friendly. also, we're only talking about one block of data which is presumably a fraction of the block cache's size * should the block bytes be accessed as a byte[] of ByteBuffer. i know there's been work on off-heap cache, but i've read the blocks are pulled on-heap before they're parsed (??) - see org.apache.hadoop.hbase.keyvalue.trie.profile.MemoryAccessProfiler for a comparison of byte[] vs ByteBuffer speed tests. ByteBuffer looks ~2-10x slower, but some people need the off-heap cache - i'll propose maintaining separate reading algorithms for each, given that the underlying bytes are in the exact same format. should be possible to copy/paste the code and replace bytes[i] with ByteBuffer.get(i), and create parallel versions of static util methods * each block has some metadata wrapped in a class called PtBlockMeta. does HBase currently have a way to store its values as java primitives on the heap rather than parsing them out of the byte[]/ByteBuffer on every access? if they have to be parsed out on every block access, that could take more time than the Get query it's trying to perform. - I know there's a new Cachable interface or something like that. maybe that already supports it or could be enhanced What jiras do you think i should make? Look forward to hearing people's feedback, Matt --000e0cd48df6751afb04accdcfab--