hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilm Schumacher <wilm.schumac...@gmail.com>
Subject Re: Streaming data to htable
Date Fri, 13 Feb 2015 14:14:01 GMT
Am 13.02.2015 um 10:39 schrieb Sleiman Jneidi:
> I would go with second option, HtableInterface.put(List<Put>). The first
> option sounds dodgy, where 5 minutes is a good time for things to go wrong
> and you lose your data

I agree with Sleiman. In my opinion the "multi put" option is the best plan.

The time a hbase client needs for a "multi put" for 10, 100 or 1000 is
nearly the same because of the "overhead" of the operation. The larger
the array of "Put" gets, the more efficient the application will be (as
a rule of thumb).

And I would make a simple system of three threads. The first is the
streaming thread which eats up the streamed data and generates a "Put"
and put it into a ArrayList. If the ArrayList is larger than 1M or 10M
elements (such a number is quite common and is realistic) OR 5 min
(second thread is for timing) are over a new ArrayList for dumping new
data is created and the first array is given the the third thread to be
putted. By this you never loses streamed data because your thread is
blocked and the data is never older than 5 mins. And the implementation
should be very easy.

As the op askes for more options: There is a third option. You could use
another system to buffer which does not have the same overhead problem.
E.g. you could dump the data first into a sql table and then run over
that with mapred or whatever. But that's not a clever option. I just
added it for completeness.

But as Sleiman I think your second option is the way to go!

Best wishes

Wilm

Mime
View raw message