hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jl...@streamy.com>
Subject Re: Table Upload Optimization
Date Wed, 21 Oct 2009 22:35:59 GMT
That depends on how much memory you have for each node.  I recommend 
setting heap to 1/2 total memory

In general, I do not recommend running with VMs... Running two hbase 
nodes on a single node in VMs vs running one hbase node on the same node 
w/o VM, I don't really see where you'd get any benefit.

You should install something like Ganglia to help monitor the cluster. 
Swap is reported through free, top, just about anything (as well as 
ganglia).

JG

Mark Vigeant wrote:
> Also, I updated the configuration and things seem to be working a bit better.
> 
> What's a good heap size to set?
> 
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of stack
> Sent: Wednesday, October 21, 2009 12:46 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Table Upload Optimization
> 
> On Wed, Oct 21, 2009 at 8:53 AM, Mark Vigeant
> <mark.vigeant@riskmetrics.com>wrote:
> 
>>> I saw this in your first posting: 10/21/09 10:22:52 INFO mapred.JobClient:
>>> map 100% reduce 0%.
>>> Is your job writing hbase in the map task or in reducer?  Are you using
>>> TableOutputFormat?
>> I am using table output format and only a mapper. There is no reducer.
>> Would a reducer make things more efficient?
>>
>>
> No.  Unless you need the reduce step for some reason avoid it.
> 
> 
> 
> 
>>>> I'm using Hadoop 0.20.1 and HBase 0.20.0
>>>>
>>>> Each node is a virtual machine with 2 CPU, 4 GB host memory and 100 GB
>>>> storage.
>>>>
>>>>
>>> You are running DN, TT, HBase, and ZK on above?  One disk shared by all?
>> I'm only running zookeeper on 2 of the above nodes, and then a TT DN and
>> regionserver on all.
>>
>>
> zk cluster should be an odd number.
> 
> One disk shared by all?
> 
> 
> 
>>> Children running at any one time on a TaskTracker.  You should start with
>>> one only since you have such an anemic platform.
>> Ah, and I can set that in the hadoop config?
>>
>>
> 
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>2</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
> 
> 
> 
> St.Ack
> 
> 
> 
>>> You've upped filedescriptors and xceivers, all the stuff in 'Getting
>>> Started'?
>> And no it appears as though I accidentally overlooked that beginning stuff.
>> Yikes. Ok.
>>
>> I will take care of those and get back to you.
>>
>>
> 
> 
>>> -----Original Message-----
>>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
>>> Jean-Daniel Cryans
>>> Sent: Wednesday, October 21, 2009 11:04 AM
>>> To: hbase-user@hadoop.apache.org
>>> Subject: Re: Table Upload Optimization
>>>
>>> Well the XMLStreamingInputFormat lets you map XML files which is neat
>>> but it has a problem and always needs to be patched. I wondered if
>>> that was missing but in your case it's not the problem.
>>>
>>> Did you check the logs of the master and region servers? Also I'd like to
>>> know
>>>
>>> - Version of Hadoop and HBase
>>> - Nodes's hardware
>>> - How many map slots per TT
>>> - HBASE_HEAPSIZE from conf/hbase-env.sh
>>> - Special configuration you use
>>>
>>> Thx,
>>>
>>> J-D
>>>
>>> On Wed, Oct 21, 2009 at 7:57 AM, Mark Vigeant
>>> <mark.vigeant@riskmetrics.com> wrote:
>>>> No. Should I?
>>>>
>>>> -----Original Message-----
>>>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
>>> Jean-Daniel Cryans
>>>> Sent: Wednesday, October 21, 2009 10:55 AM
>>>> To: hbase-user@hadoop.apache.org
>>>> Subject: Re: Table Upload Optimization
>>>>
>>>> Are you using the Hadoop Streaming API?
>>>>
>>>> J-D
>>>>
>>>> On Wed, Oct 21, 2009 at 7:52 AM, Mark Vigeant
>>>> <mark.vigeant@riskmetrics.com> wrote:
>>>>> Hey
>>>>>
>>>>> So I want to upload a lot of XML data into an HTable. I have a class
>>> that successfully maps up to about 500 MB of data or so (on one
>>> regionserver) into a table, but if I go for much bigger than that it
>> takes
>>> forever and eventually just stops. I tried uploading a big XML file into
>> my
>>> 4 regionserver cluster (about 7 GB) and it's been a day and it's still
>> going
>>> at it.
>>>>> What I get when I run the job on the 4 node cluster is:
>>>>> 10/21/09 10:22:35 INFO mapred.LocalJobRunner:
>>>>> 10/21/09 10:22:38 INFO mapred.LocalJobRunner:
>>>>> (then it does that for a while until...)
>>>>> 10/21/09 10:22:52 INFO mapred.TaskRunner: Task
>>> attempt_local_0001_m_000117_0 is done. And is in the process of
>> committing
>>>>> 10/21/09 10:22:52 INFO mapred.LocalJobRunner:
>>>>> 10/21/09 10:22:52 mapred.TaskRunner: Task
>>> 'attempt_local_0001_m_000117_0' is done.
>>>>> 10/21/09 10:22:52 INFO mapred.JobClient:   map 100% reduce 0%
>>>>> 10/21/09 10:22:58 INFO mapred.LocalJobRunner:
>>>>> 10/21/09 10:22:59 INFO mapred.JobClient: map 99% reduce 0%
>>>>>
>>>>>
>>>>> I'm convinced I'm not configuring hbase or hadoop correctly. Any
>>> suggestions?
>>>>> Mark Vigeant
>>>>> RiskMetrics Group, Inc.
>>>>>
> 

Mime
View raw message