Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of anoop.hbase@gmail.com
 designates 209.85.217.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAT7MkqOmYn=NTrbb2j0FFm=4OeGZPfxcVZgfVvUAPviR1i-=w@mail.gmail.com>
References: 
 <CANZa=Gudhr73nLuLU+AQkwmiNUiqmsEm-q4vjobwLKY6VwBzrQ@mail.gmail.com>
	<760781993.5880812.1417848935871.JavaMail.yahoo@jws10662.mail.bf1.yahoo.com>
	<CAAT7Mkoh8Tf+SXTa4epx1ONFmDG+o3WdBM4kyDS12noL18cSjg@mail.gmail.com>
	<CADcMMgEbPg0fXOuKkUXtH0hytFAFNRm9vbpHqBXVPXbAszaS6w@mail.gmail.com>
	<CAAT7MkqOmYn=NTrbb2j0FFm=4OeGZPfxcVZgfVvUAPviR1i-=w@mail.gmail.com>
Date: Mon, 8 Dec 2014 18:29:09 +0530
Message-ID: 
 <CAOtJ30pfhNWUH_TmGxcgtDe-czwbnfOuYVTx1f2b3=LJ-TZuSA@mail.gmail.com>
Subject: Re: ByteBuffer Backed Cell - New APIs (HBASE-12358)
From: Anoop John <anoop.hbase@gmail.com>
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c370840584dc0509b3fe59

--001a11c370840584dc0509b3fe59
Content-Type: text/plain; charset=UTF-8

>>Is there discussion on why ByteRange and netty ByteBuf were by-passed
It is because of 2 reasons
1. We want to read to our off-heap space right from HDFS layer.. (Using API
taking the out buffer). Well this works with NIO buffer
2. At our RPC end at least now, we deal with java NIO channel. We need pass
BB to it. If we go with say a BR based API from Cell, how we put data to
this channel?  In case of off-heap our BR impl will be DBB backed.(at least
initial impl) So we may have to read byte by byte and write byte by byte to
channel. Or else need on heap temp byte[] copy. Else we may have to add API
in BR, to which pass a Channel and within the BR impl write to Channel. (In
this overhead is not there) But this may not look clean. What if when some
one want to impl the RPC with say Netty?  I checked Netty. It seems it is
very easy to convert a BB to netty's ByteBuf. Not overhead also.
BB is having overhead in reads. If Unsafe is available, we can avoid BB
overhead. In read path, Comparators as well as seek within block , we can
use BBUtil based access. HBASE-12345 adding Unsafe compares for BB.

BTW am doing changes in our RPC Server and client so as to avoid need to
copy data to on heap (to make CellBlock Buffer) at RpcServer.  With an PoC
code it worked. Yet to measure the perf as we are in middle of these e2e
changes.

>>just go ahead and deprecate getXXXArray in next Cell version, the one
that adds getXXXBuffer (hbase 2.0)
So in hbase 2.0 we deprecate getxxxArray() APIs and add getxxxBuffer()
APIs. What abt the behavior of the old getxxxArray() APIs in that version?
It has to work as in old case always? Or it will work only when
Cell#hasArray() is true and throw UnsupportedOpException otherwise.(?)

-Anoop-


On Mon, Dec 8, 2014 at 3:07 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> Lars,
> >>For writing, for getting the key, etc.
> KeyValue.getBuffer if I see in the current trunk code is currently getting
> used only in cases of tests and some unused code path like PrefixTreecode
> decodeKeyValues.
>
> I agree there are places like rowAtOrBeforeFromStoreFile and
> StoreFileScanner.seekToPreviousRow().  But even for seekToPreviousRow() in
> that they are all deprecated and the actual code does not use
> seekToPreviousRow() that does kv.getBuffer(), instead it uses
> seekBefore(Cell).
>
> The main place where we use getBuffer() is KeyValueUtil.oswrite() and there
> too  the usage of getbuffer is avoided in a smart way.
>
> Stack,
> >>Comparator will be doing absolute (unsafe) gets against the passed-in BB
> inside the offset/length bounds if the BB comes back hasArray == false?
>
> Yes. Something like this
>
>      static int getInt(ByteBuffer buf, int offset) {
>        if (buf.isDirect()) {
>          return theUnsafe.getInt(((DirectBuffer) buf).address() + offset);
>        } else {
>          assert buf.hasArray();
>          return theUnsafe.getInt(buf.array(), BYTE_ARRAY_BASE_OFFSET +
> buf.arrayOffset() + offset);
>        }
>      }
>
> >>You are describing an implementation that did refcounting and had
> protection so we didn't free a block from cache that was still being read.
>
> Yes.  RefCounting is a seperate task which we would do before this is used
> E2E.
>
> How will it work if block is compressed/encoded?
> For a compressed/encoded block with DBEs - currently we are not going to
> apply any in the algo level - so it will still deal with byte[].  The only
> thing is the special impl of the Cell in these DBEs would create BB backed
> Cells rather than array backed Cell.
>
>
> >>Just a suggestion wondering if we can keep the implementation detail
> underneath our simple API.
> Sure this can be done.  Atleast deprecate it in 1.0 for the getXXXarray and
> the add new API in hbase2.0.
>
>
> Regards
> Ram
>
>
> On Mon, Dec 8, 2014 at 11:34 AM, Stack <stack@duboce.net> wrote:
>
> > On Sat, Dec 6, 2014 at 9:36 AM, ramkrishna vasudevan <
> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> > > Thanks for valuable feedback from every one.
> > > >>Is there discussion on why ByteRange and netty ByteBuf were by-passed
> > (I
> > > have a good notion of why but would be good to have this in a
> > 'design/spec'
> > > doc for completeness since they were contenders)
> > >
> > > True. I would leave to Anoop to comment on that. But the reason is the
> > RPC
> > > layer and the socket layer that accepts only BBs.
> > >
> >
> >
> > Would be good to talk up +/- of BBs (thats what we are getting in the
> back
> > door but downsides include range check, etc.)
> >
> >
> > >>Is this enough?  If offheap, to do compares, won't we also need
> > positional
> > > gets?
> > >
> > > Yes it would be enough. Even if we have offheap buffer we could do
> reads
> > > with Unsafe comparators - for absolute gets on the BB.
> > >
> >
> > You mean, you'd use the return from getXXXBuffer + getXXXOffset +
> > getXXXLength and then pass these three args to comparator?   Comparator
> > will be doing absolute (unsafe) gets against the passed-in BB inside the
> > offset/length bounds if the BB comes back hasArray == false?
> >
> >
> >
> > >
> > > >>a new object -- either
> > > each time a getXXXBuffer is called or a Cell instance will cache these
> > > slices so Cells will grow at least a reference * each Cell part.
> > >
> > > In (1) we won't do slice. So there is no new object at all. It would be
> > the
> > > same buffer that we created from the HFileBlock that is denoting a
> Cell.
> > >
> > >
> > Ok. Yeah. Was thinking we'd slice for length and offset in #1. You are
> > describing an implementation that did refcounting and had protection so
> we
> > didn't free a block from cache that was still being read.
> >
> > How will it work if block is compressed/encoded?
> >
> >
> >
> > > Slicing is as in (2). But if we slice the buffer and create new object
> -
> > > the suggestion of caching it (as a ref) makes sense.
> > > But still (2) would mean that in cases of blockSeeks we end up in
> > creating
> > > lot of small objects and as Lars says it would have a JVM reference to
> > > clean those small objects.
> > >
> > > >>What if we did not add getXXXBuffer to the API?  A Cell could be
> backed
> > > by
> > > an onheap BB without having to reveal this fact in the API?  If offheap
> > BB,
> > > we'd need to copy the Cell Key pieces onheap to support getArray. How
> bad
> > > would that be?
> > >
> > > Stack - if we don't add getXXXBuffer and try to use getArray() then we
> > lose
> > > the whole purpose of supporting read path directly from the BB
> underlying
> > > in block cache. May be for onheap BB it is fine but I think we should
> > have
> > > some generic way to access onheap and offheap BB thus abstracting that
> > > information from the user.
> > >
> > >
> > Yeah. My suggestion was between #1 and #2.  We'd keep current API and
> > behind the scenes we'd do the BB manipulations.  If offheap, we'd just
> copy
> > the byte array components of the key on heap. This would lose a bunch of
> > benefit but would be good to quantify (faster compares though?).  Could
> > also mitigate by sharing byte arrays where row or cf were common, etc.
> > Just a suggestion wondering if we can keep the implementation detail
> > underneath our simple API.
> >
> >
> > >
> > > >>Yeah. This is broke. You can't have getXXXBuffer being offheap
> > > one-minute,
> > > then on-heap the next.
> > > Yes - that is why we wanted to have a proper contract on this.  Jon'
> > > suggestion of adding getBBOFfset and getArrayOffset would solve that
> but
> > > would add more confusion.
> > >
> > >
> >
> > We'll add hasArray and as you suggest it will always return false because
> > at same time getXXXArray will deprecated ... or just don't bother adding
> > hasArray in the Interface (users can do hasArray on the BB returned out
> of
> > getXXXBuffer if they want).... just go ahead and deprecate getXXXArray in
> > next Cell version, the one that adds getXXXBuffer (hbase 2.0)
> >
> > St.Ack
> >
> >
> >
> > > >>We'd purge getXXXLength and getXXXOffset if we go route #2.
> > > Yes we have to.  Or if we don't purge then getXXXLength and
> getXXXOffset
> > > would be used only with getXXXArray and with getXXXBuffer no need of
> > those
> > > APIs.
> > >
> > > >>So returning a reference to an existing BB (on the stack) along with
> > > offset and length is better from that angle.
> > > This is myy opinion too. And I would go with this option.
> > >
> > > >>KeyValue.getBuffer(). We have not even finished that part, yet.
> > > Lars - I think this is done both in read path and write path.
> > >
> > >
> > >
> > > On Sat, Dec 6, 2014 at 12:25 PM, lars hofhansl <larsh@apache.org>
> wrote:
> > >
> > > > A slice make a new object, though. We'll inundate in the JVM with
> lots
> > > and
> > > > lots of small objects that it needs to trace.So returning a reference
> > to
> > > an
> > > > existing BB (on the stack) along with offset and length is better
> from
> > > that
> > > > angle.
> > > > Also, before we do any of that we need to get rid of all occurrences
> of
> > > > KeyValue.getBuffer(). We have not even finished that part, yet.
> > > > -- Lars
> > > >       From: Nick Dimiduk <ndimiduk@gmail.com>
> > > >  To: "dev@hbase.apache.org" <dev@hbase.apache.org>
> > > >  Sent: Friday, December 5, 2014 8:33 AM
> > > >  Subject: Re: ByteBuffer Backed Cell - New APIs (HBASE-12358)
> > > >
> > > > From the perspective of API going forward, I prefer (2). BB already
> > > allows
> > > > for encapsulating a sequence if bytes in a single object, so why
> > > > re-implement that? I suspect you'll want to return slices to the
> caller
> > > > anyway, because you don't want them messing with the backing store's
> > > > position,
> > > > limit, etc state.
> > > >
> > > > On Friday, December 5, 2014, Jonathan Hsieh <jon@cloudera.com>
> wrote:
> > > >
> > > > > I think of it from a compatibility point of view - each method has
> > its
> > > > own
> > > > > semantics and the semantics should not morph between versions.
> > > > >
> > > > > With the repurposing the old apis approach, someone who ports to
> the
> > > new
> > > > > version without porting to the api will likely run into some nasty
> > > > > surprises.  Worse yet, they'd need to modify their code and
> > understand
> > > > the
> > > > > details of what the semantics changes are.
> > > > >
> > > > > Having completely new apis makes it easier to see when deprecated
> > > > > behavior/methods are removed and doesn't require code inspection to
> > > > verify.
> > > > > (can rely on compiler warnings)  Also, old code won't be broken
> (just
> > > > less
> > > > > efficient).
> > > > >
> > > > > Jon.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Dec 4, 2014 at 10:41 PM, ramkrishna vasudevan <
> > > > > ramkrishna.s.vasudevan@gmail.com <javascript:;>> wrote:
> > > > >
> > > > > > We thought about this suggestion too but this comes with a lot of
> > > more
> > > > > > confusion (?) - what do you think?
> > > > > > Adding new APIs but similar in behaviour could lead to lot more
> > > > > confusion?
> > > > > > Internally it is more easy.
> > > > > >
> > > > > >
> > > > > > On Fri, Dec 5, 2014 at 11:10 AM, Jonathan Hsieh <
> jon@cloudera.com
> > > > > <javascript:;>> wrote:
> > > > > >
> > > > > > > Ram,
> > > > > > >
> > > > > > > Can we essentially do both by creating the new getXxBuffer, and
> > > then
> > > > > also
> > > > > > > creating new offset and len apis --
> getXxx(Bb|Buffer|Buf|B)Offset
> > > and
> > > > > > > getXxx(Bb|Buffer|Buf|B)Length?
> > > > > > >
> > > > > > > The old getXxxArray could use the old getXxxOffset and
> > getXxxLength
> > > > > > calls.
> > > > > > > Also we'd deprecate all of these and provide an sliced and
> diced
> > > > > version
> > > > > > > that would have offset always 0.
> > > > > > >
> > > > > > > This way we wouldn't conflate the Bb and byte[] offsets and
> > > lengths.
> > > > > > Also
> > > > > > > we could behind the scenes convert Bbs to byte[] arrays and
> > convert
> > > > > > > byte[]'s in to Bbs while maintaining the same interface.
> > > > > > >
> > > > > > > Jon.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 4, 2014 at 7:24 PM, ramkrishna vasudevan <
> > > > > > > ramkrishna.s.vasudevan@gmail.com <javascript:;>> wrote:
> > > > > > >
> > > > > > > > Hi Devs
> > > > > > > >
> > > > > > > > This write up is to provide a brief idea  on why we need a BB
> > > > backed
> > > > > > cell
> > > > > > > > and what are the items that we need to take care before
> > > introducing
> > > > > new
> > > > > > > > APIs in Cell that are BB backed.
> > > > > > > >
> > > > > > > > Pls refer to
> https://issues.apache.org/jira/browse/HBASE-12358
> > > > also
> > > > > > and
> > > > > > > > its
> > > > > > > > parent JIRA
> https://issues.apache.org/jira/browse/HBASE-11425
> > > for
> > > > > the
> > > > > > > > history.
> > > > > > > >
> > > > > > > > Coming back to the discussion on new APIs, this discussion is
> > > based
> > > > > on
> > > > > > > > supporting BB in the read path (write path is not targeted
> now)
> > > so
> > > > > that
> > > > > > > we
> > > > > > > > could work with offheap BBs also. This would avoid copying of
> > > data
> > > > > from
> > > > > > > > BlockCache to the read path ByteBuffer.
> > > > > > > >
> > > > > > > > Assume we will be working with BBs in the read path, We will
> > need
> > > > to
> > > > > > > >  introduce *getXXXBuffer() *APIs and also *hasArray()* in
> Cell
> > > > itself
> > > > > > > > directly.
> > > > > > > > If we try to extend the cell or create a new Cell then
> > > *everywhere
> > > > we
> > > > > > > need
> > > > > > > > to do instanceOf check or do type conversion *and that is why
> > > > adding
> > > > > > new
> > > > > > > > APIs to Cell interface itself makes sense.
> > > > > > > >
> > > > > > > > Plan is to use this *getXXXBuffer()* API through out the read
> > > path
> > > > > > > *instead
> > > > > > > > of getXXXArray()*.
> > > > > > > >
> > > > > > > > Now there are two ways to use it
> > > > > > > >
> > > > > > > > 1) Use getXXXBuffer() along with getXXXOffset(),
> getXXXLength()
> > > > like
> > > > > > how
> > > > > > > we
> > > > > > > > use now for getXXXArray() APIs with the offset and length.
> > Doing
> > > so
> > > > > > would
> > > > > > > > ensure that every where in the filters and CP one has to just
> > > > replace
> > > > > > the
> > > > > > > > getXXXArray() with getXXXBuffer() and continue to use
> > > > getXXXOffset()
> > > > > > and
> > > > > > > > getXXXLength(). We would do some wrapping of the byte[] with
> a
> > BB
> > > > > > incase
> > > > > > > of
> > > > > > > > KeyValue type of cells so that getXXXBuffer along with offset
> > and
> > > > > > length
> > > > > > > > holds true everywhere. Note that here if hasArray is true(for
> > KV
> > > > > case)
> > > > > > > then
> > > > > > > > getXXXArray() would also work.
> > > > > > > >
> > > > > > > > 2)The other way of using this is that use only getXXXBuffer()
> > API
> > > > and
> > > > > > > > ensure that the BB is always duplicated/sliced and only the
> > > portion
> > > > > of
> > > > > > > the
> > > > > > > > total BB is returned which represents the individual
> component
> > of
> > > > the
> > > > > > > Cell.
> > > > > > > > In this case there is no use of getXXXOffset() (as it is
> going
> > to
> > > > be
> > > > > 0)
> > > > > > > and
> > > > > > > > getXXXLength() is any way going to be the sliced BB's limit.
> > > > > > > >
> > > > > > > > But in the 2nd approach we may end up in creating lot of
> small
> > > > > objects
> > > > > > > even
> > > > > > > > while doing comparison.
> > > > > > > >
> > > > > > > > Now the next problem that comes is what to do with the
> > > > getXXXArray()
> > > > > > > APIs.
> > > > > > > > If one sees hasArray() as false (a DBB backed Cell) and uses
> > the
> > > > > > > > getXXXArray() API along with offset and length - what should
> we
> > > do.
> > > > > > > Should
> > > > > > > > we create a byte[] from the DBB and return it? Then in that
> > case
> > > > what
> > > > > > > would
> > > > > > > > should the *getXXXOffset() return for a getXXXBuffer or
> > > > > getXXXArray()?*
> > > > > > > >
> > > > > > > > If we go with the 2nd approach then getXXXBuffer() should be
> > > > clearly
> > > > > > > > documented saying that it has to be used without
> getXXXOffset()
> > > and
> > > > > > > > getXXXLength() and use getXXXOffset() and getXXXLength() only
> > > with
> > > > > > > > getXXXArray().
> > > > > > > >
> > > > > > > > Now if a Cell is backed by on heap BB then we could
> definitely
> > > > return
> > > > > > > > getXXXArray() also - but what to return in the getXXXOffset()
> > > would
> > > > > be
> > > > > > > > determined by what approach to use for getXXXBuffer(). (based
> > on
> > > > (1)
> > > > > > and
> > > > > > > > (2)).
> > > > > > > >
> > > > > > > > We wanted to open up this topic now so that to get some
> > feedback
> > > on
> > > > > > what
> > > > > > > > could be an option here. Since it is an user facing Interface
> > we
> > > > need
> > > > > > to
> > > > > > > be
> > > > > > > > careful with this.
> > > > > > > >
> > > > > > > > I would suggest that whenever a Cell is *BB backed*(Onheap or
> > > > > offheap)
> > > > > > > > always *hasArray() would be false* in that Cell impl.
> > > > > > > >
> > > > > > > > Every where we would use getXXXBuffer() along with
> > getXXXOffest()
> > > > and
> > > > > > > > getXXXLength(). Even in case of KV we could wrap the byte[]
> > with
> > > BB
> > > > > so
> > > > > > > that
> > > > > > > > we have uniformity through the read code and we don't have
> too
> > > many
> > > > > > 'if'
> > > > > > > > else conditions.
> > > > > > > >
> > > > > > > > When ever *hasArray() is false* - using getXXXArray() API
> would
> > > > throw
> > > > > > > > *UnSupportedOperation
> > > > > > > > Exception*.
> > > > > > > >
> > > > > > > > As said if we want *getXXXArray()* to be supported as per the
> > > > > existing
> > > > > > > way
> > > > > > > > then getXXXBuffer() and getXXXOffset(), getXXXLength() should
> > be
> > > > > > clearly
> > > > > > > > documented.
> > > > > > > >
> > > > > > > > Thoughts!!!
> > > > > > > >
> > > > > > > > Regards
> > > > > > > > Ram & Anoop
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > // Jonathan Hsieh (shay)
> > > > > > > // HBase Tech Lead, Software Engineer, Cloudera
> > > > > > > // jon@cloudera.com <javascript:;> // @jmhsieh
> > > >
> > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > // Jonathan Hsieh (shay)
> > > > > // HBase Tech Lead, Software Engineer, Cloudera
> > > > > // jon@cloudera.com <javascript:;> // @jmhsieh
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

--001a11c370840584dc0509b3fe59--