hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-48) [hbase] Bulk load tools
Date Fri, 18 Sep 2009 14:51:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757186#action_12757186

Jonathan Gray commented on HBASE-48:

The MR job is working tremendously well for me.  I'm able to almost instantly saturate my
entire cluster during an upload and it remains saturated until the end.  Full CPU usage, lots
of io-wait, so I'm disk io-bound as I should be.

I did a few runs of a job which imported between 1M and 10M rows, each row containing a random
number of columns from 1 to 1000.  In the end, I imported between 500M and 5B KeyValues.

On a 5 node cluster of 2core/2gb/250gb nodes, I could import 1M rows / 500M keys in 7.5 minutes
(2.2k rows/sec, 1.1M keys/sec).

On a 10 node cluster of 4core/4gb/500gb nodes, I could do the same import in 2.5 minutes.
 On this larger cluster I also ran the same job but with 10M rows / 5B keys in 25 minutes
(6.6k rows/sec, 3.3M keys/sec).

Previously running HTable-based imports on these clusters, I was seeing between 100k and 200k
keys/sec, so this represents a 5-15X speed improvement.  In addition, the imports finish without
any problem (I would have killed the little cluster running these imports through HBase).

I think there is a bug with the ruby script though.  It worked sometimes, but other times
it ended up hosing the cluster until I restarted.  Things worked fine after restart.

Still digging...

> [hbase] Bulk load tools
> -----------------------
>                 Key: HBASE-48
>                 URL: https://issues.apache.org/jira/browse/HBASE-48
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>            Priority: Minor
>         Attachments: 48-v2.patch, 48-v3.patch, 48-v4.patch, 48-v5.patch, 48.patch, loadtable.rb
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via the current
APIs, particularly if the dataset is large and cell content is small, uploads can take a long
time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write regions directly
in hdfs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message