hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkrishna.S.Vasudevan" <ramkrishna.vasude...@huawei.com>
Subject RE: RPC KeyValue encoding
Date Tue, 04 Sep 2012 09:53:51 GMT
Lars

Is it possible to have hooks in this layer while we handle version counting so that we can
know what kvs were omitted due to versioning?  

Regards
Ram

> -----Original Message-----
> From: Matt Corgan [mailto:mcorgan@hotpads.com]
> Sent: Tuesday, September 04, 2012 5:25 AM
> To: dev@hbase.apache.org; lars hofhansl
> Subject: Re: RPC KeyValue encoding
> 
> I created separate reviewboard requests for the hbase-common module and
> the
> hbase-prefix-tree module.  First one has the Cell interface,
> CellOutputStream, CellScanner, etc mentioned above.
> 
> hbase-common: https://reviews.apache.org/r/6897/
> hbase-prefix-tree: https://reviews.apache.org/r/6898/
> 
> Will leave tests out for now.  They're on github.
> 
> On Mon, Sep 3, 2012 at 3:17 PM, lars hofhansl <lhofhansl@yahoo.com>
> wrote:
> 
> > That reminds me of another thought to occurred to me while looking at
> > ScanQueryMatcher.
> > I was marveling at the relative complexity of it (together with
> > StoreScanner) - admittedly some of this is my fault (see HBASE-4536
> and
> > HBASE-4071)
> >
> > It would be so much easier to read if we had proper iterator trees
> (at
> > least at some places in the code), similar to how relational database
> > express execution plans (using scanner and iterator interchangeably).
> >
> > Then:
> > - StoreScanner would just read from HFiles and pass KVs up, in fact
> it
> > might no longer be needed
> > - Filters can be expressed as an iterator over those KVs.
> > - Handing deleted KVs would be another iterator
> > - So would be the version handling/counting
> > - Scanner would not need to passed List<KeyValue> to accumulate KVs
> in,
> > but simply return KVs as they are encountered.
> >
> > RegionScanner and StoreScanner would retain the KeyValueHeap to
> mergesort
> > their sub scanners.
> > The overall complexity would remain the same, but the parts would be
> > isolated better.
> >
> > Something like this:
> > RegionScanner -> HeapIt ->-> VersionCounterIt -> FilterIt ->
> TimeRangeIt->
> > DeletesIt -> StoreScanner -> HeapIt ->-> StoreFileIt -> ...
> > (Should rename some of the things)
> >
> > All iterators would issue (re)seeks when possible.
> >
> > The iterator interface would be something like <init>, KV next(),
> close(),
> > seek(KV), reseek(KV). Many Iterators would be stateful.
> >
> > Would probably be a major refactoring, and the devil would be in the
> > details. We would need to be careful to keep the current performance
> > (currently ScanQueryMatcher is efficient, because it does a bunch of
> steps
> > at the same time).
> >
> > Just "blue skying".
> >
> > -- Lars
> >
> >
> > ________________________________
> > From: Matt Corgan <mcorgan@hotpads.com>
> > To: dev@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
> > Sent: Monday, September 3, 2012 11:24 AM
> > Subject: Re: RPC KeyValue encoding
> >
> > >
> > > For CellAppender, is compile() equivalent to flushing ?
> >
> > Yes.  I'll rename CellAppender to CellOutputStream.  The concept is
> very
> > similar to a GzipOutputStream where you write bytes to it and
> periodically
> > call flush() which spits out a compressed byte[] behind the scenes.
> The
> > server would write Cells to a CellOutputStream, flush them to a
> byte[] and
> > send the byte[] to the client.  There could be a default encoding,
> and the
> > client could send a flag to override the default.
> >
> > Greg, you mention omitting fields that are repeated from one KeyValue
> to
> > the next.  I think this is basically what the existing
> DataBlockEncoders
> > are doing for KeyValues stored on disk (see PrefixKeyDeltaEncoder for
> > example).  I'm thinking we can use the same encoders for encoding on
> the
> > wire.  Different implementations will have different performance
> > characteristics where some may be better for disk and others for RPC,
> but
> > the overall intent is the same.
> >
> > Matt
> >
> > On Sun, Sep 2, 2012 at 2:56 PM, lars hofhansl <lhofhansl@yahoo.com>
> wrote:
> >
> > > Your "coarse grain" options is what I had in mind in my email. I
> love the
> > > option of not needing to get it all right in 0.96.
> > >
> > > You, Matt, and I could talk and work out the details and get it
> done.
> > >
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Gregory Chanan <gchanan@cloudera.com>
> > > To: dev@hbase.apache.org
> > > Cc:
> > > Sent: Sunday, September 2, 2012 12:52 PM
> > > Subject: Re: RPC KeyValue encoding
> > >
> > > Lars,
> > >
> > > If we make the KeyValue wire format flexible enough I think we'll
> be able
> > > to tackle the KV as an interface work later.  Just throwing out
> some
> > ideas
> > > here:
> > >
> > > We could have a byte at the front of each KV serialization format
> that
> > > gives various options in each bit e.g.
> > > Omits Rows / Omits Family / Omits Qualifier / Omits Timestamp /
> Omits
> > Value
> > > / plus some extra bytes for compression options and extensions.
> Then we
> > > just need to define where the KV gets its field if it is omitted,
> e.g.
> > from
> > > the previous KV in the RPC that had that field filled in.  We sort
> of
> > have
> > > this with the optional fields already, although I don't recall
> exactly
> > how
> > > protobuf handles those (we'd probably have to do some small
> > restructuring);
> > > what's new is defining what it means when a field is omitted.
> > >
> > > There's some overhead with the above for small KVs, so you could
> also go
> > > coarser grain, e.g. the Get request/response could have a similar
> options
> > > byte like:
> > > All Share Same Row / All Share Same Family / ... / and one of the
> bits
> > > could turn on the finer grain options above (per KeyValue).
> > >
> > > The advantage of this is that all we'd have to get right in 0.96.0
> is the
> > > deserialization.  The serialization could just send without any of
> the
> > > options turned on.  And we could experiment later with each
> specific RPC
> > > call what the best options to use are, as well as what storage to
> > actually
> > > use client/server side, which you discuss.
> > >
> > > Greg
> > >
> > > On Sun, Sep 2, 2012 at 9:04 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > > > Thanks for the update, Matt.
> > > >
> > > > w.r.t. Cell class, since it is so fundamental, should it reside
> in org.
> > > > apache.hadoop.hbase namespace as KeyValue class does ?
> > > > For CellAppender, is compile() equivalent to flushing ?
> > > >
> > > > Looking forward to your publishing on the reviewboard.
> > > >
> > > > On Sat, Sep 1, 2012 at 11:29 PM, Matt Corgan
> <mcorgan@hotpads.com>
> > > wrote:
> > > >
> > > > > RPC encoding would be really nice since there is sometimes
> > significant
> > > > wire
> > > > > traffic that could be reduced many-fold.  I have a particular
> table
> > > that
> > > > i
> > > > > scan and stream to a gzipped output file on S3, and i've
> noticed that
> > > > while
> > > > > the app server's network input is 100Mbps, the gzipped output
> can be
> > > > 2Mbps!
> > > > >
> > > > > Finishing the PrefixTree has been slow because I've saved a
> couple
> > > tricky
> > > > > issues to the end and am light on time.  i'll try to put it on
> > > > reviewboard
> > > > > monday despite a known bug.  It is built with some of the ideas
> you
> > > > mention
> > > > > in mind, Lars.  Take a look at the
> > > > > Cell<
> > > > >
> > > >
> > >
> > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-
> common/src/main/java/org/apache/hadoop/hbase/cell/Cell.java
> > > > > >
> > > > >  and CellAppender<
> > > > >
> > > >
> > >
> > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-
> common/src/main/java/org/apache/hadoop/hbase/cell/appender/CellAppender
> .java
> > > > > >
> > > > > classes
> > > > > and their comments.  The idea with the CellAppender is to
> stream
> > cells
> > > > into
> > > > > it and periodically compile()/flush() into a byte[] which can
> be
> > saved
> > > to
> > > > > an HFile or (eventually) sent over the wire.  For example, in
> > > > > HRegion.get(..), the CellAppender would replace the
> > > "ArrayList<KeyValue>
> > > > > results" collection.
> > > > >
> > > > > After introducing the Cell interface, the trick to extending
> the
> > > encoded
> > > > > cells up the HBase stack will be to reduce the reliance on
> > stand-alone
> > > > > KeyValues.  We'll want things like the Filters and KeyValueHeap
> to be
> > > > able
> > > > > to operate on reused Cells without materializing them into full
> > > > KeyValues.
> > > > >  That means that something like StoreFileScanner.peek() will
> not work
> > > > > because the scanner cannot maintain the state of the currrent
> and
> > next
> > > > > Cells at the same time.  See
> > > > > CellCollator<
> > > > >
> > > >
> > >
> > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-
> common/src/main/java/org/apache/hadoop/hbase/cell/collator/CellCollator
> .java
> > > > > >
> > > > > for
> > > > > a possible replacement for KeyValueHeap.  The good news is that
> this
> > > can
> > > > be
> > > > > done in stages without major disruptions to the code base.
> > > > >
> > > > > Looking at PtDataBlockEncoderSeeker<
> > > > >
> > > >
> > >
> > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-prefix-
> tree/src/main/java/org/apache/hbase/codec/prefixtree/PtDataBlockEncoder
> Seeker.java
> > > > > >,
> > > > > this would mean transitioning from the getKeyValue() method
> that
> > > creates
> > > > > and fills a new KeyValue every time it's called to the
> > getCurrentCell()
> > > > > method which returns a reference to a Cell buffer that is
> reused as
> > the
> > > > > scanner proceeds.  Modifying a reusable Cell buffer rather than
> > rapidly
> > > > > shooting off new KeyValues should drastically reduce byte[]
> copying
> > and
> > > > > garbage churn.
> > > > >
> > > > > I wish I understood the protocol buffers more so I could
> comment
> > > > > specifically on that.  The result sent to the client can
> possibly be
> > a
> > > > > plain old encoded data block (byte[]/ByteBuffer) with a similar
> > header
> > > to
> > > > > the one encoded blocks have on disk (2 byte DataBlockEncoding
> id).
> > The
> > > > > client then uses the same
> > > > > CellScanner<
> > > > >
> > > >
> > >
> > https://github.com/hotpads/hbase/blob/prefix-tree/hbase-
> common/src/main/java/org/apache/hadoop/hbase/cell/scanner/CellScanner.j
> ava
> > > > > >that
> > > > > the server uses when reading blocks from the block cache.  A
> nice
> > > > > side-effect of sending the client an encoded byte[] is that the
> java
> > > > client
> > > > > can run the same decoder that the server uses which should be
> > > > tremendously
> > > > > faster and more memory efficient than the current method of
> building
> > a
> > > > > pointer-heavy result map.  I had envisioned this kind of thing
> being
> > > > baked
> > > > > into ClientV2, but i guess it could be wrangled into the
> current one
> > if
> > > > > someone wanted.
> > > > >
> > > > > food for thought... cheers,
> > > > > Matt
> > > > >
> > > > > ps - i'm travelling tomorrow so may be silent on email
> > > > >
> > > > > On Sat, Sep 1, 2012 at 9:03 PM, lars hofhansl
> <lhofhansl@yahoo.com>
> > > > wrote:
> > > > >
> > > > > > In 0.96 we changing the wire protocol to use protobufs.
> > > > > >
> > > > > > While we're at it, I am wondering whether we can optimize a
> few
> > > things:
> > > > > >
> > > > > >
> > > > > > 1. A Put or Delete can send many KeyValues, all of which have
> the
> > > same
> > > > > row
> > > > > > key and many will likely have the same column family.
> > > > > > 2. Likewise a Scan result or Get is for a single row. Each KV
> will
> > > > again
> > > > > > will have the same row key and many will have the same column
> > family.
> > > > > > 3. The client and server do not need to share the same KV
> > > > implementation
> > > > > > as long as they are (de)serialized the same. KVs on the
> server will
> > > be
> > > > > > backed by a shared larger byte[] (the block reads from disk),
> the
> > KVs
> > > > in
> > > > > > the memstore will probably have the same implementation (to
> use
> > slab,
> > > > but
> > > > > > maybe even here it would be benificial to store the row key
> and CF
> > > > > > separately and share between KV where possible). Client KVs
> on the
> > > > other
> > > > > > hand could share a row key and or column family.
> > > > > >
> > > > > > This would require a KeyValue interface and two different
> > > > > implementations;
> > > > > > one backed by a byte[] another that stores the pieces
> separately.
> > > Once
> > > > > that
> > > > > > is done one could even envision KVs backed by a byte buffer.
> > > > > >
> > > > > > Both (de)serialize the same, so when the server serializes
> the KVs
> > it
> > > > > > would send the row key first, then the CF, then column, TS,
> finally
> > > > > > followed by the value. The client could deserialize this and
> > directly
> > > > > reuse
> > > > > > the shared part in its KV implementation.
> > > > > > That has the potentially to siginificantly cut down
> client/server
> > > > network
> > > > > > IO and save memory on the client, especially with wide
> columns.
> > > > > >
> > > > > > Turning KV into an interface is a major undertaking. Would it
> be
> > > worth
> > > > > the
> > > > > > effort? Or maybe the RPC should just be compressed?
> > > > > >
> > > > > >
> > > > > > We'd have to do that before 0.96.0 (I think), because even
> protobuf
> > > > would
> > > > > > not provide enough flexibility to make such a change later -
> which
> > > > > > incidentally leads to another discussion about whether client
> and
> > > > server
> > > > > > should do an initial handshake to detect each others version,
> but
> > > that
> > > > > is a
> > > > > > different story.
> > > > > >
> > > > > >
> > > > > > -- Lars
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> >


Mime
View raw message