hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Limotte <mslimo...@gmail.com>
Subject Re: hbase bulk load / table split
Date Tue, 04 Jan 2011 18:43:19 GMT
Thanks for the suggestion, Michael.  I could give that a shot.

I'm still wondering what the system is currently doing.  Is it trying to
split that one region?  Why is it taking so long?  Any way to check the


On Tue, Jan 4, 2011 at 6:03 AM, Marc Limotte <mslimotte@gmail.com> wrote:

> I've made some good progress using the HBase Bulk Load Tool.   With HBase
> 0.89.20100924+28.
> My initial implementation did not have importtsv do compression, and it ran
> directly on the hbase cluster's hadoop.  It's been working ok for a while
> (but slow, because of limited resources).
> My next implementation, as discussed in another thread, has compression
> settings turned on for importtsv (thanks, Lars).  And I am running the
> importtsv on a remote cluster and then distcp'ing (thanks, Todd) the results
> to the HBase cluster for the completebulkload step.
> I'm trying this out with a fresh (empty) Hbase table.  So, the first run of
> importtsv takes a very long time, because the table only has one region, so
> it starts only one Reducer.
>    - Bulk load into a new table
>    - About 20 GB of data (compressed with gzip)
>    - Created one massive region
> It seemed to complete successfully.  But we are seeing some intermittent
> errors (missing blocks and such).
> Could not obtain block: blk_-5944324410280250477_429443
>> file=/hbase/mytable/7c2b09e1ef8c4984732f362d7876305c/metrics/7947729174003011436
> The initial region seems to have split once, but I'm not sure the split
> completed, since the key ranges overlap and the storeFileSizeMB seems to be
> about as big as it started out.  My theory is that the initial load is too
> large for a region, and the split either failed or is still in progress.
>  Both on the same Region Server:
>> mytable,ad_format728x90site_category2advertiser14563countrysepublisher2e03ab73-b234-4413-bcee-6183a99bd840starttime1293897600,1294094158507.2360f0a03e2566c72ea1a07c40f5f296.
>> stores=2, storefiles=1075, storefileSizeMB=19230, memstoreSizeMB=0,
>> storefileIndexSizeMB=784
>> --
>> mytable,,1294094158507.33b1e47c5fb004aa801b0bd88ce8322d.
>> stores=2, storefiles=1083, storefileSizeMB=19546, memstoreSizeMB=0,
>> storefileIndexSizeMB=796
> Another new table on this same hbase loaded around the same time, has
> already split into 69 regions (storefileSizeMB 200 - 400 each).  This one
> was loaded in smaller chunks with importtsv running directly on the hbase
> cluster, but also with compression on.
> Now that I've gotten all the background down, here are my questions:
>    1. Is it still working on the split?  Any way to monitor progress?
>    2. Can I force more splits?
>    3. Should I have done something first to avoid having the bulk load
>    create one big region?
>    4. Would it be easier to split if my initial bulkload was not gzip
>    compressed?
>    5. Am I looking in the wrong place entirely for this issue?
> thanks,
> Marc

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message