hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nanheng Wu <nanhen...@gmail.com>
Subject Re: Bulk load questions
Date Wed, 29 Dec 2010 20:48:57 GMT
I am trying a different approach right now: the MR job I am running
uses the identity mapper and a custom comparator to randomize the keys
(input keys are sorted). The inserts happen in the reducer, which does
very little work. My job is still running very slowly. All my nodes
seem to be under utilized in terms of CPU ( < 50%) and from the HBase
UI their "usedHeap" is hovering below 700MB even though I set the
maxHeap to be 2GB. What should I look into to improve the performance?

On Mon, Dec 27, 2010 at 9:10 PM, Stack <stack@duboce.net> wrote:
> TOF has an HBase client HTable in it.  Its certainly easier using TOF.
>  Unless you have special needs, I'd stick w/ TOF.
> Good luck,
> St.Ack
> On Mon, Dec 27, 2010 at 1:03 PM, Nanheng Wu <nanhengwu@gmail.com> wrote:
>> Thanks for the answers. I will use these as my basis for
>> investigation. I am using a mapper only job, is it better to use the
>> HBase client to write to HBase or TableOutputFormat?
>> On Mon, Dec 27, 2010 at 8:38 AM, Stack <stack@duboce.net> wrote:
>>> On Mon, Dec 27, 2010 at 1:54 AM, Nanheng Wu <nanhengwu@gmail.com> wrote:
>>>> I am running some tests to load data from HDFS into HBase in a MR job.
>>>> I am pretty new to HBase and I have some questions regarding bulk load
>>>> performance: I have a small cluster with 4 nodes, I set up one node to
>>>> run Namenode/JobTracker/ZK, and the other three nodes all run
>>>> TaskTracker/DataNode/HRegion. During my test I am seeing about 1300
>>>> inserts per second total and it feels kind of slow.
>>> I don't know what your hardware is like but yeah, it sounds kinda slow.
>>> My rows are pretty
>>>> small ~250 bytes. I am wondering if it is a good idea to be running MR
>>>> on all nodes. Would it be better if I run MR load job on separate
>>>> nodes?
>>> Well, where do you think the time is being spent?  What is holding up
>>> the job do you think?  Is your MR job doing any massaging of the data.
>>>  Do you have many concurrent mappers run at same time on each node?
>>> Does your MR job do a map and reduce or just a map?  Is it the insert
>>> into hbase that is slow?  What do the hbase logs say?  Are they
>>> blocking because they are flushing memory?
>>> Also I observe that one task tracker's CPU usage was twice as
>>>> high as the other two.
>>> Maybe its the one that is doing the inserting?  How many regions in
>>> your hbase cluster?  When you look at hbase UI, is load being spread
>>> across the hbase cluster or you just hitting one node?
>>> St.Ack
>>>  I can't figure out why that is, does that
>>>> indicate some hot spots in the cluster? I'd really appreciate some
>>>> ideas, and please let me know if my description is not specific or
>>>> detailed enough and what other information I can provide to help
>>>> diagnose the problem. Thanks!

View raw message