hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ramkrishna vasudevan <ramkrishna.s.vasude...@gmail.com>
Subject ByteBuffer Backed Cell - New APIs (HBASE-12358)
Date Fri, 05 Dec 2014 03:24:07 GMT
Hi Devs

This write up is to provide a brief idea  on why we need a BB backed cell
and what are the items that we need to take care before introducing new
APIs in Cell that are BB backed.

Pls refer to https://issues.apache.org/jira/browse/HBASE-12358 also and its
parent JIRA https://issues.apache.org/jira/browse/HBASE-11425 for the
history.

Coming back to the discussion on new APIs, this discussion is based on
supporting BB in the read path (write path is not targeted now) so that we
could work with offheap BBs also. This would avoid copying of data from
BlockCache to the read path ByteBuffer.

Assume we will be working with BBs in the read path, We will need to
 introduce *getXXXBuffer() *APIs and also *hasArray()* in Cell itself
directly.
If we try to extend the cell or create a new Cell then *everywhere we need
to do instanceOf check or do type conversion *and that is why adding new
APIs to Cell interface itself makes sense.

Plan is to use this *getXXXBuffer()* API through out the read path *instead
of getXXXArray()*.

Now there are two ways to use it

1) Use getXXXBuffer() along with getXXXOffset(), getXXXLength() like how we
use now for getXXXArray() APIs with the offset and length. Doing so would
ensure that every where in the filters and CP one has to just replace the
getXXXArray() with getXXXBuffer() and continue to use getXXXOffset() and
getXXXLength(). We would do some wrapping of the byte[] with a BB incase of
KeyValue type of cells so that getXXXBuffer along with offset and length
holds true everywhere. Note that here if hasArray is true(for KV case) then
getXXXArray() would also work.

2)The other way of using this is that use only getXXXBuffer() API and
ensure that the BB is always duplicated/sliced and only the portion of the
total BB is returned which represents the individual component of the Cell.
In this case there is no use of getXXXOffset() (as it is going to be 0) and
getXXXLength() is any way going to be the sliced BB's limit.

But in the 2nd approach we may end up in creating lot of small objects even
while doing comparison.

Now the next problem that comes is what to do with the getXXXArray() APIs.
If one sees hasArray() as false (a DBB backed Cell) and uses the
getXXXArray() API along with offset and length - what should we do. Should
we create a byte[] from the DBB and return it? Then in that case what would
should the *getXXXOffset() return for a getXXXBuffer or getXXXArray()?*

If we go with the 2nd approach then getXXXBuffer() should be clearly
documented saying that it has to be used without getXXXOffset() and
getXXXLength() and use getXXXOffset() and getXXXLength() only with
getXXXArray().

Now if a Cell is backed by on heap BB then we could definitely return
getXXXArray() also - but what to return in the getXXXOffset() would be
determined by what approach to use for getXXXBuffer(). (based on (1) and
(2)).

We wanted to open up this topic now so that to get some feedback on what
could be an option here. Since it is an user facing Interface we need to be
careful with this.

I would suggest that whenever a Cell is *BB backed*(Onheap or offheap)
always *hasArray() would be false* in that Cell impl.

Every where we would use getXXXBuffer() along with getXXXOffest() and
getXXXLength(). Even in case of KV we could wrap the byte[] with BB so that
we have uniformity through the read code and we don't have too many 'if'
else conditions.

When ever *hasArray() is false* - using getXXXArray() API would throw
*UnSupportedOperation
Exception*.

As said if we want *getXXXArray()* to be supported as per the existing way
then getXXXBuffer() and getXXXOffset(), getXXXLength() should be clearly
documented.

Thoughts!!!

Regards
Ram & Anoop

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message