hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Ranganathan (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-5783) Faster HBase bulk loader
Date Fri, 13 Apr 2012 19:20:17 GMT
Faster HBase bulk loader

                 Key: HBASE-5783
                 URL: https://issues.apache.org/jira/browse/HBASE-5783
             Project: HBase
          Issue Type: New Feature
          Components: client, ipc, performance, regionserver
            Reporter: Karthik Ranganathan
            Assignee: Nicolas Spiegelberg

We can get a 3x to 4x gain based on a prototype demonstrating this approach in effect (hackily)
over the MR bulk loader for very large data sets by doing the following:

1. Do direct multi-puts from HBase client using GZIP compressed RPC's
2. Turn off WAL (we will ensure no data loss in another way)
3. For each bulk load client, we need to:
3.1 do a put
3.2 get back a tracking cookie (memstoreTs or HLogSequenceId) per put
3.3 be able to ask the RS if the tracking cookie has been flushed to disk
4. For each client, we can succeed it if the tracking cookie for the last put it did (for
every RS) makes it to disk. Otherwise the map task fails and is retried.
5. If the last put did not make it to disk for a timeout (say a second or so) we issue a manual

- Increase the memstore size so that we flush larger files
- Decrease the compaction ratios (say increase the number of files to compact)

Quick background:

The bottlenecks in the multiput approach are that the data is transferred *uncompressed* twice
over the top-of-rack: once from the client to the RS (on the multi put call) and again because
of WAL (HDFS replication). We reduced the former with RPC compression and eliminated the latter
above while still guaranteeing that data wont be lost.

This is better than the MR bulk loader at a high level because we dont need to merge sort
all the files for a given region and then make it a HFile - thats the equivalent of bulk loading
AND majorcompacting in one shot. Also there is much more disk involved in the MR method (sort/spill).

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message