hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: feature request and question: "BigPut" and "BigGet"
Date Mon, 09 Mar 2015 04:01:09 GMT
Thanks for looking into this Wilm.
I would honestly suggest just writing larger lobs directly into HDFS and just store the location
in HBase.
You can do that with a relatively simple protocol, with reasonable safety:1. Write the metadata
row into HBase2. Write the LOB into HDFS3. When the LOB was written, update the metadata row
with the LOBs location.4. Report success back to the client

If the LOB is small... maybe < 1mb, you'd just write it into HBase as a value (preferably
into a different column family)

If the process fails at #2 or #3 you'd have an orphaned file in HDFS, but those are easy to
find (metadata rows for which the location is unset, and older than - say - a few days)

Your BigPut and BigGet could just be an API around this process.

-- Lars

     From: Wilm Schumacher <wilm.schumacher@gmail.com>
 To: dev@hbase.apache.org 
 Sent: Sunday, March 8, 2015 7:55 PM
 Subject: feature request and question: "BigPut" and "BigGet"

I have an idea for a feature in hbase which directly derives from the
idea of the MOB feature. As Jonathan Hsieh pointed out, the only thing
that limiting the feature to MOBs instead to LOBs is the memory
allocation on client and server side. However, the "LOB feature" would
be very handy for me and I think for some other users, too. Furthermore
the fast fetching small files problem could be solved.

The natural solution would be a "BigPut" and a "BigGet" class, which
encounter that problem, which are capable of dealing with large amount
of data without using too much memory. My plan by now is to creates
classes that do e.g.
BigPut BigPut.add( byte[] , byte[] , inputstream )
outputstream BigResult.value( byte[] , byte[] )
(in addition to the normal byte[] to byte[] member functions)

and pass the inputstreams through the AsyncProcess class to the RPC or
in reverse the outputstream for the BigResult class. By this plan the
client and server would have to throw out some threads to deal with
multiple streams[1].

By now I dig into the hbase-client (2.0.0) sources and I think that my
plan would be quite invasive to the existing code ... but is doable.
However, regarding the very open development model of hbase features I
think it could be adressed.

But I'm veeeery new to hbase development and just started to read the
source. Before I dig to deep into the problem I wanted to ask here if
there is any show stopper I'm missing by now?
To make a list of questions for that feature:
* As this plan probably won't break the thread model of the
hbase-client, is there any problem on the (region) server side? Or is
there any blocking/race condition problem elsewhere I miss by now?
* Is it a bad plan to pump several 100s of MB through one RPC in a
separate thread? If yes ... why?
* Are there any other fundamental problems I miss by now which makes
that a horrible plan?
* Is there already some dev onging? I didn't found something on jira.
But that doesn't mean anything :/
* Does anyone have a better name than "BigPut" :D?

And at last:
* Is it a better plan to create a separate "MOB/LOB service"?[2]

Best wishes


[1] or one could limit the number of streams to one. By this the
threading problem would be much more simple to encounter as only one
"RPC" would be neccessary.

[2] on one hand it is easier to bare LOBs in mind if you create a
service e.g. with a rest interface (multipart data etc), on the other
hand you have to reinvent the wheel (compaction etc.)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message