hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tsuna <tsuna...@gmail.com>
Subject Re: asynchronous hbase for batch inserts?
Date Sat, 05 Feb 2011 09:58:49 GMT
On Fri, Feb 4, 2011 at 4:43 AM, Shuja Rehman <shujamughal@gmail.com> wrote:
> More specific can provide example equivalent to this
>
> for(i=0; i<N; i++) {
>  list.add(putitem[i]);
> }
> htable.put(list);

The equivalent would be:

Callback<Object, Object> callback = new Callback<Object, Object> {
  public Object run(Object arg) {
    // Do whatever you want on a successful write.
    return arg;
  }
  public String toString() {
    return "handle successful write";
  }
};

Callback<Object, Object> errback = new Callback<Object, Object> {
  public Object run(Object arg) {
    // Do whatever you want on a failed write.
    return arg;
  }
  public String toString() {
    return "handle failed write";
  }
};

PutRequest[] putitem = ...;
HBaseClient client = ...;
for (int i = 0; i < N; i++) {
  client.put(putitem[i]).addCallbacks(callback, errback);
}

For each PutRequest, either `callback' or `errback' will be called
asynchronously from a different thread (you can't control which)
whenever the request has completed.  If you only want to handle
failures (which is common), you can do:
for (int i = 0; i < N; i++) {
  client.put(putitem[i]).addErrback(errback);
}

For more on the Deferred API, please read
http://www.tsunanet.net/~tsuna/async/api/com/stumbleupon/async/Deferred.html
Deferred is a very powerful API for any kind of asynchronous processing.


So overall the code remains the same except that:
 * You use callbacks to get the result of your operation asynchronously.
 * You don't give a whole list to the client at once, you give it
requests one by one, it does the batching internally anyway.
 * You must be prepared to handle the response from another thread
(you don't know which), so your callbacks need to be thread-safe and
call thread-safe APIs.

Internally, asynchbase will route each PutRequest to the right region
server.  It uses both a size-based and time-based flush threshold
(e.g. either after N milliseconds or M edits, whichever reaches its
threshold first).  Depending on your workload, asynchbase can achieve
higher batching efficiency than HTable and lower latency.  In OpenTSDB
I've seen dramatic improvements of up to an order of magnitude.  In
case of failures due to region splits, asynchbase also behaves better
because instead of retrying all the failed edits in one batch, it'll
retry each edit individually, so edits that aren't going to the region
being split won't need to wait for the split to terminate unlike with
HTable (with multiPut at least).

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Mime
View raw message