Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 29DB7D53D for ; Mon, 3 Sep 2012 18:25:12 +0000 (UTC) Received: (qmail 7459 invoked by uid 500); 3 Sep 2012 18:25:11 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 7403 invoked by uid 500); 3 Sep 2012 18:25:11 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 7395 invoked by uid 99); 3 Sep 2012 18:25:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Sep 2012 18:25:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: unknown (athena.apache.org: error in processing during lookup of mcorgan@hotpads.com) Received: from [209.85.212.179] (HELO mail-wi0-f179.google.com) (209.85.212.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Sep 2012 18:25:06 +0000 Received: by wibhq4 with SMTP id hq4so3226693wib.2 for ; Mon, 03 Sep 2012 11:24:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=7e3MZgDIh9D2T5odxiJo1SEtGogzr/Tuj+SZtWg8zn8=; b=dll74dAxnTLyP4YWmLS8d9+x5XDwzBvaru5Zjey+PunPjRtd1BtQi0o1U9fBQ4FXsd SZe6SmGZwyGu7jdEIsaG3bydzua+O55Q62S7thylhaeEKzWBMBi5AWjMcmUrEvVBLyRm J6VB/8KytMrxkD0uO8U4BNCMWPJ3sBuzZQAQXeM05w9QwpE7aUmwb4+pqn5CgOq2gFhZ R8YmpYrZfZ9wJJyQcGhZMXxsxqeQxwYDfz7RWlTwk1XpQtntN9SJsHLHFVrRhJRgn8bC HguDQfoUyZEBGP+rRwqQXHz8jVDKXli9ShCq9romYLbdk2l4Dlsd+FC5aybMafBU6orL PSNQ== MIME-Version: 1.0 Received: by 10.180.20.204 with SMTP id p12mr25065996wie.7.1346696684656; Mon, 03 Sep 2012 11:24:44 -0700 (PDT) Received: by 10.216.161.193 with HTTP; Mon, 3 Sep 2012 11:24:44 -0700 (PDT) In-Reply-To: <1346622992.68692.YahooMailNeo@web121704.mail.ne1.yahoo.com> References: <1346558624.58968.YahooMailNeo@web121705.mail.ne1.yahoo.com> <1346622992.68692.YahooMailNeo@web121704.mail.ne1.yahoo.com> Date: Mon, 3 Sep 2012 11:24:44 -0700 Message-ID: Subject: Re: RPC KeyValue encoding From: Matt Corgan To: dev@hbase.apache.org, lars hofhansl Content-Type: multipart/alternative; boundary=bcaec53d5ce382573104c8d041b1 X-Gm-Message-State: ALoCoQmk6VT9XHWD7lrmjUuzGaRngHjDTZ4IBYMdVDqGUALDPDHbYiAFBle1yZGGhSuPMvng/1Mx X-Virus-Checked: Checked by ClamAV on apache.org --bcaec53d5ce382573104c8d041b1 Content-Type: text/plain; charset=UTF-8 > > For CellAppender, is compile() equivalent to flushing ? Yes. I'll rename CellAppender to CellOutputStream. The concept is very similar to a GzipOutputStream where you write bytes to it and periodically call flush() which spits out a compressed byte[] behind the scenes. The server would write Cells to a CellOutputStream, flush them to a byte[] and send the byte[] to the client. There could be a default encoding, and the client could send a flag to override the default. Greg, you mention omitting fields that are repeated from one KeyValue to the next. I think this is basically what the existing DataBlockEncoders are doing for KeyValues stored on disk (see PrefixKeyDeltaEncoder for example). I'm thinking we can use the same encoders for encoding on the wire. Different implementations will have different performance characteristics where some may be better for disk and others for RPC, but the overall intent is the same. Matt On Sun, Sep 2, 2012 at 2:56 PM, lars hofhansl wrote: > Your "coarse grain" options is what I had in mind in my email. I love the > option of not needing to get it all right in 0.96. > > You, Matt, and I could talk and work out the details and get it done. > > > -- Lars > > > ----- Original Message ----- > From: Gregory Chanan > To: dev@hbase.apache.org > Cc: > Sent: Sunday, September 2, 2012 12:52 PM > Subject: Re: RPC KeyValue encoding > > Lars, > > If we make the KeyValue wire format flexible enough I think we'll be able > to tackle the KV as an interface work later. Just throwing out some ideas > here: > > We could have a byte at the front of each KV serialization format that > gives various options in each bit e.g. > Omits Rows / Omits Family / Omits Qualifier / Omits Timestamp / Omits Value > / plus some extra bytes for compression options and extensions. Then we > just need to define where the KV gets its field if it is omitted, e.g. from > the previous KV in the RPC that had that field filled in. We sort of have > this with the optional fields already, although I don't recall exactly how > protobuf handles those (we'd probably have to do some small restructuring); > what's new is defining what it means when a field is omitted. > > There's some overhead with the above for small KVs, so you could also go > coarser grain, e.g. the Get request/response could have a similar options > byte like: > All Share Same Row / All Share Same Family / ... / and one of the bits > could turn on the finer grain options above (per KeyValue). > > The advantage of this is that all we'd have to get right in 0.96.0 is the > deserialization. The serialization could just send without any of the > options turned on. And we could experiment later with each specific RPC > call what the best options to use are, as well as what storage to actually > use client/server side, which you discuss. > > Greg > > On Sun, Sep 2, 2012 at 9:04 AM, Ted Yu wrote: > > > Thanks for the update, Matt. > > > > w.r.t. Cell class, since it is so fundamental, should it reside in org. > > apache.hadoop.hbase namespace as KeyValue class does ? > > For CellAppender, is compile() equivalent to flushing ? > > > > Looking forward to your publishing on the reviewboard. > > > > On Sat, Sep 1, 2012 at 11:29 PM, Matt Corgan > wrote: > > > > > RPC encoding would be really nice since there is sometimes significant > > wire > > > traffic that could be reduced many-fold. I have a particular table > that > > i > > > scan and stream to a gzipped output file on S3, and i've noticed that > > while > > > the app server's network input is 100Mbps, the gzipped output can be > > 2Mbps! > > > > > > Finishing the PrefixTree has been slow because I've saved a couple > tricky > > > issues to the end and am light on time. i'll try to put it on > > reviewboard > > > monday despite a known bug. It is built with some of the ideas you > > mention > > > in mind, Lars. Take a look at the > > > Cell< > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java > > > > > > > and CellAppender< > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender.java > > > > > > > classes > > > and their comments. The idea with the CellAppender is to stream cells > > into > > > it and periodically compile()/flush() into a byte[] which can be saved > to > > > an HFile or (eventually) sent over the wire. For example, in > > > HRegion.get(..), the CellAppender would replace the > "ArrayList > > > results" collection. > > > > > > After introducing the Cell interface, the trick to extending the > encoded > > > cells up the HBase stack will be to reduce the reliance on stand-alone > > > KeyValues. We'll want things like the Filters and KeyValueHeap to be > > able > > > to operate on reused Cells without materializing them into full > > KeyValues. > > > That means that something like StoreFileScanner.peek() will not work > > > because the scanner cannot maintain the state of the currrent and next > > > Cells at the same time. See > > > CellCollator< > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator.java > > > > > > > for > > > a possible replacement for KeyValueHeap. The good news is that this > can > > be > > > done in stages without major disruptions to the code base. > > > > > > Looking at PtDataBlockEncoderSeeker< > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoderSeeker.java > > > >, > > > this would mean transitioning from the getKeyValue() method that > creates > > > and fills a new KeyValue every time it's called to the getCurrentCell() > > > method which returns a reference to a Cell buffer that is reused as the > > > scanner proceeds. Modifying a reusable Cell buffer rather than rapidly > > > shooting off new KeyValues should drastically reduce byte[] copying and > > > garbage churn. > > > > > > I wish I understood the protocol buffers more so I could comment > > > specifically on that. The result sent to the client can possibly be a > > > plain old encoded data block (byte[]/ByteBuffer) with a similar header > to > > > the one encoded blocks have on disk (2 byte DataBlockEncoding id). The > > > client then uses the same > > > CellScanner< > > > > > > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.java > > > >that > > > the server uses when reading blocks from the block cache. A nice > > > side-effect of sending the client an encoded byte[] is that the java > > client > > > can run the same decoder that the server uses which should be > > tremendously > > > faster and more memory efficient than the current method of building a > > > pointer-heavy result map. I had envisioned this kind of thing being > > baked > > > into ClientV2, but i guess it could be wrangled into the current one if > > > someone wanted. > > > > > > food for thought... cheers, > > > Matt > > > > > > ps - i'm travelling tomorrow so may be silent on email > > > > > > On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl > > wrote: > > > > > > > In 0.96 we changing the wire protocol to use protobufs. > > > > > > > > While we're at it, I am wondering whether we can optimize a few > things: > > > > > > > > > > > > 1. A Put or Delete can send many KeyValues, all of which have the > same > > > row > > > > key and many will likely have the same column family. > > > > 2. Likewise a Scan result or Get is for a single row. Each KV will > > again > > > > will have the same row key and many will have the same column family. > > > > 3. The client and server do not need to share the same KV > > implementation > > > > as long as they are (de)serialized the same. KVs on the server will > be > > > > backed by a shared larger byte[] (the block reads from disk), the KVs > > in > > > > the memstore will probably have the same implementation (to use slab, > > but > > > > maybe even here it would be benificial to store the row key and CF > > > > separately and share between KV where possible). Client KVs on the > > other > > > > hand could share a row key and or column family. > > > > > > > > This would require a KeyValue interface and two different > > > implementations; > > > > one backed by a byte[] another that stores the pieces separately. > Once > > > that > > > > is done one could even envision KVs backed by a byte buffer. > > > > > > > > Both (de)serialize the same, so when the server serializes the KVs it > > > > would send the row key first, then the CF, then column, TS, finally > > > > followed by the value. The client could deserialize this and directly > > > reuse > > > > the shared part in its KV implementation. > > > > That has the potentially to siginificantly cut down client/server > > network > > > > IO and save memory on the client, especially with wide columns. > > > > > > > > Turning KV into an interface is a major undertaking. Would it be > worth > > > the > > > > effort? Or maybe the RPC should just be compressed? > > > > > > > > > > > > We'd have to do that before 0.96.0 (I think), because even protobuf > > would > > > > not provide enough flexibility to make such a change later - which > > > > incidentally leads to another discussion about whether client and > > server > > > > should do an initial handshake to detect each others version, but > that > > > is a > > > > different story. > > > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > --bcaec53d5ce382573104c8d041b1--