hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Corgan <mcor...@hotpads.com>
Subject Re: RPC KeyValue encoding
Date Sun, 02 Sep 2012 06:29:26 GMT
RPC encoding would be really nice since there is sometimes significant wire
traffic that could be reduced many-fold.  I have a particular table that i
scan and stream to a gzipped output file on S3, and i've noticed that while
the app server's network input is 100Mbps, the gzipped output can be 2Mbps!

Finishing the PrefixTree has been slow because I've saved a couple tricky
issues to the end and am light on time.  i'll try to put it on reviewboard
monday despite a known bug.  It is built with some of the ideas you mention
in mind, Lars.  Take a look at the
Cell<https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java>
 and CellAppender<https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender.java>
classes
and their comments.  The idea with the CellAppender is to stream cells into
it and periodically compile()/flush() into a byte[] which can be saved to
an HFile or (eventually) sent over the wire.  For example, in
HRegion.get(..), the CellAppender would replace the "ArrayList<KeyValue>
results" collection.

After introducing the Cell interface, the trick to extending the encoded
cells up the HBase stack will be to reduce the reliance on stand-alone
KeyValues.  We'll want things like the Filters and KeyValueHeap to be able
to operate on reused Cells without materializing them into full KeyValues.
 That means that something like StoreFileScanner.peek() will not work
because the scanner cannot maintain the state of the currrent and next
Cells at the same time.  See
CellCollator<https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator.java>
for
a possible replacement for KeyValueHeap.  The good news is that this can be
done in stages without major disruptions to the code base.

Looking at PtDataBlockEncoderSeeker<https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoderSeeker.java>,
this would mean transitioning from the getKeyValue() method that creates
and fills a new KeyValue every time it's called to the getCurrentCell()
method which returns a reference to a Cell buffer that is reused as the
scanner proceeds.  Modifying a reusable Cell buffer rather than rapidly
shooting off new KeyValues should drastically reduce byte[] copying and
garbage churn.

I wish I understood the protocol buffers more so I could comment
specifically on that.  The result sent to the client can possibly be a
plain old encoded data block (byte[]/ByteBuffer) with a similar header to
the one encoded blocks have on disk (2 byte DataBlockEncoding id).  The
client then uses the same
CellScanner<https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.java>that
the server uses when reading blocks from the block cache.  A nice
side-effect of sending the client an encoded byte[] is that the java client
can run the same decoder that the server uses which should be tremendously
faster and more memory efficient than the current method of building a
pointer-heavy result map.  I had envisioned this kind of thing being baked
into ClientV2, but i guess it could be wrangled into the current one if
someone wanted.

food for thought... cheers,
Matt

ps - i'm travelling tomorrow so may be silent on email

On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:

> In 0.96 we changing the wire protocol to use protobufs.
>
> While we're at it, I am wondering whether we can optimize a few things:
>
>
> 1. A Put or Delete can send many KeyValues, all of which have the same row
> key and many will likely have the same column family.
> 2. Likewise a Scan result or Get is for a single row. Each KV will again
> will have the same row key and many will have the same column family.
> 3. The client and server do not need to share the same KV implementation
> as long as they are (de)serialized the same. KVs on the server will be
> backed by a shared larger byte[] (the block reads from disk), the KVs in
> the memstore will probably have the same implementation (to use slab, but
> maybe even here it would be benificial to store the row key and CF
> separately and share between KV where possible). Client KVs on the other
> hand could share a row key and or column family.
>
> This would require a KeyValue interface and two different implementations;
> one backed by a byte[] another that stores the pieces separately. Once that
> is done one could even envision KVs backed by a byte buffer.
>
> Both (de)serialize the same, so when the server serializes the KVs it
> would send the row key first, then the CF, then column, TS, finally
> followed by the value. The client could deserialize this and directly reuse
> the shared part in its KV implementation.
> That has the potentially to siginificantly cut down client/server network
> IO and save memory on the client, especially with wide columns.
>
> Turning KV into an interface is a major undertaking. Would it be worth the
> effort? Or maybe the RPC should just be compressed?
>
>
> We'd have to do that before 0.96.0 (I think), because even protobuf would
> not provide enough flexibility to make such a change later - which
> incidentally leads to another discussion about whether client and server
> should do an initial handshake to detect each others version, but that is a
> different story.
>
>
> -- Lars
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message