hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: Fast importing into HBase (bypassing RegionServer)
Date Tue, 28 Jul 2009 16:19:33 GMT
Though HBase imports are fairly fast, they would probably be 5-10x 
faster with a straight-to-hfile import method.

Once we get 0.20.0 shipped, we should have more time to spend on 
actually implementing this.  Though anyone is welcome to take a shot. 
Stack described it well.


Ryan Rawson wrote:
> The last time I seriously looked at this, it was to answer serious
> performance issues with HBase.  I eventually fixed said performance
> issues, and thus went on to drop the idea overall.
> -ryan
> On Mon, Jul 27, 2009 at 1:52 PM, stack<stack@duboce.net> wrote:
>> Latest thinking is write a MR job that in the reducer writes hfiles that are
>> just under a region size (<256M).  When reducer has reached about 240MB, it
>> opens new file.  (May need to write custom ReduceRunner to keep account of
>> whats been written and to rotate the file).
>> After the MR has finished, a script would come along, move the hfiles into
>> appropriate directory structure.  Each hfile would be the sole content of
>> the region.  The script would read from each hfile's metadata its first and
>> last keys and then using this metainfo along with a table format specified
>> externally, insert an entry into .META. per region (See the scripts in bin
>> -- copy and rename table -- for examples of how to manipulate .META.).
>> Someone needs to just do it.  We've been talking about it for ever.
>> St.Ack
>> P.S. Here is older thinking on the topic
>> https://issues.apache.org/jira/browse/HBASE-48
>> On Mon, Jul 27, 2009 at 1:31 PM, tim robertson <timrobertson100@gmail.com>wrote:
>>> Hi all,
>>> Ryan wrote on a different thread:
>>> "It should be possible to randomly insert data from a pre-existing
>>> data set.  There is some work to directly import straight into hfiles
>>> and skipping the regionserver, but that would only really work on 1
>>> time imports to new tables."
>>> Could someone please elaborate on this a little and outline the steps
>>> needed?  Do you write an hfile in a custom mapreduce output format and
>>> then somehow write the table metadata file afterwards?
>>> Cheers,
>>> Tim

View raw message