hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: RPC KeyValue encoding
Date Sun, 02 Sep 2012 16:04:40 GMT
Thanks for the update, Matt.

w.r.t. Cell class, since it is so fundamental, should it reside in org.
apache.hadoop.hbase namespace as KeyValue class does ?
For CellAppender, is compile() equivalent to flushing ?

Looking forward to your publishing on the reviewboard.

On Sat, Sep 1, 2012 at 11:29 PM, Matt Corgan <mcorgan@hotpads.com> wrote:

> RPC encoding would be really nice since there is sometimes significant wire
> traffic that could be reduced many-fold.  I have a particular table that i
> scan and stream to a gzipped output file on S3, and i've noticed that while
> the app server's network input is 100Mbps, the gzipped output can be 2Mbps!
> Finishing the PrefixTree has been slow because I've saved a couple tricky
> issues to the end and am light on time.  i'll try to put it on reviewboard
> monday despite a known bug.  It is built with some of the ideas you mention
> in mind, Lars.  Take a look at the
> Cell<
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java
> >
>  and CellAppender<
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender.java
> >
> classes
> and their comments.  The idea with the CellAppender is to stream cells into
> it and periodically compile()/flush() into a byte[] which can be saved to
> an HFile or (eventually) sent over the wire.  For example, in
> HRegion.get(..), the CellAppender would replace the "ArrayList<KeyValue>
> results" collection.
> After introducing the Cell interface, the trick to extending the encoded
> cells up the HBase stack will be to reduce the reliance on stand-alone
> KeyValues.  We'll want things like the Filters and KeyValueHeap to be able
> to operate on reused Cells without materializing them into full KeyValues.
>  That means that something like StoreFileScanner.peek() will not work
> because the scanner cannot maintain the state of the currrent and next
> Cells at the same time.  See
> CellCollator<
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator.java
> >
> for
> a possible replacement for KeyValueHeap.  The good news is that this can be
> done in stages without major disruptions to the code base.
> Looking at PtDataBlockEncoderSeeker<
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoderSeeker.java
> >,
> this would mean transitioning from the getKeyValue() method that creates
> and fills a new KeyValue every time it's called to the getCurrentCell()
> method which returns a reference to a Cell buffer that is reused as the
> scanner proceeds.  Modifying a reusable Cell buffer rather than rapidly
> shooting off new KeyValues should drastically reduce byte[] copying and
> garbage churn.
> I wish I understood the protocol buffers more so I could comment
> specifically on that.  The result sent to the client can possibly be a
> plain old encoded data block (byte[]/ByteBuffer) with a similar header to
> the one encoded blocks have on disk (2 byte DataBlockEncoding id).  The
> client then uses the same
> CellScanner<
> https://github.com/hotpads/hbase/blob/prefix-tree/hbase-common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.java
> >that
> the server uses when reading blocks from the block cache.  A nice
> side-effect of sending the client an encoded byte[] is that the java client
> can run the same decoder that the server uses which should be tremendously
> faster and more memory efficient than the current method of building a
> pointer-heavy result map.  I had envisioned this kind of thing being baked
> into ClientV2, but i guess it could be wrangled into the current one if
> someone wanted.
> food for thought... cheers,
> Matt
> ps - i'm travelling tomorrow so may be silent on email
> On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:
> > In 0.96 we changing the wire protocol to use protobufs.
> >
> > While we're at it, I am wondering whether we can optimize a few things:
> >
> >
> > 1. A Put or Delete can send many KeyValues, all of which have the same
> row
> > key and many will likely have the same column family.
> > 2. Likewise a Scan result or Get is for a single row. Each KV will again
> > will have the same row key and many will have the same column family.
> > 3. The client and server do not need to share the same KV implementation
> > as long as they are (de)serialized the same. KVs on the server will be
> > backed by a shared larger byte[] (the block reads from disk), the KVs in
> > the memstore will probably have the same implementation (to use slab, but
> > maybe even here it would be benificial to store the row key and CF
> > separately and share between KV where possible). Client KVs on the other
> > hand could share a row key and or column family.
> >
> > This would require a KeyValue interface and two different
> implementations;
> > one backed by a byte[] another that stores the pieces separately. Once
> that
> > is done one could even envision KVs backed by a byte buffer.
> >
> > Both (de)serialize the same, so when the server serializes the KVs it
> > would send the row key first, then the CF, then column, TS, finally
> > followed by the value. The client could deserialize this and directly
> reuse
> > the shared part in its KV implementation.
> > That has the potentially to siginificantly cut down client/server network
> > IO and save memory on the client, especially with wide columns.
> >
> > Turning KV into an interface is a major undertaking. Would it be worth
> the
> > effort? Or maybe the RPC should just be compressed?
> >
> >
> > We'd have to do that before 0.96.0 (I think), because even protobuf would
> > not provide enough flexibility to make such a change later - which
> > incidentally leads to another discussion about whether client and server
> > should do an initial handshake to detect each others version, but that
> is a
> > different story.
> >
> >
> > -- Lars
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message