hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: Help tuning a cluster - COPY slow
Date Thu, 18 Nov 2010 13:38:00 GMT
Just to close this thread.
Turns out it all came down to a mapred.reduce.parallel.copies being
overwritten to 5 on the Hive submission. Cranking that back up and
everything is happy again.

Thanks for the ideas,

Tim




On Thu, Nov 18, 2010 at 11:04 AM, Tim Robertson
<timrobertson100@gmail.com> wrote:
> Thanks again.
>
> We are getting closer to debugging this.  Our reference for all these
> tests was a simple GroupBy using Hive, but when I do a vanilla MR job
> on the tab file input to do the same group by, it flies through -
> almost exactly 2 times quicker.  Investigating further as it is not
> quite a fair test at the moment due to some config differences...
>
>
> On Thu, Nov 18, 2010 at 10:19 AM, Friso van Vollenhoven
> <fvanvollenhoven@xebia.com> wrote:
>> Do you have IPv6 enabled on the boxes? If DNS gives both IPv4 and IPv6 results for
lookups, Java will try v6 first and then fall back to v4, which is an additional connect attempt.
You can force Java to use only v4 by setting the system property java.net.preferIPv4Stack=true.
>>
>> Also, I am not sure whether Java does the same thing as nslookup when doing name
lookups (I believe it has its own cache as well, but correct me if I'm wrong).
>>
>> You could try running something like strace (with the -T option, which shows time
spent in system calls) to see whether network related system calls take a long time.
>>
>>
>>
>> Friso
>>
>>
>>
>>
>> On 17 nov 2010, at 22:20, Tim Robertson wrote:
>>
>>> I don't think so Aaron - but we use names not IPs in the config and on
>>> a node the following is instant:
>>>
>>> [root@c2n1 ~]# nslookup c1n1.gbif.org
>>> Server:               130.226.238.254
>>> Address:      130.226.238.254#53
>>>
>>> Non-authoritative answer:
>>> Name: c1n1.gbif.org
>>> Address: 130.226.238.171
>>>
>>> If I ssh onto an arbitrary machine in the cluster and pull a file
>>> using curl (e.g.
>>> http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null)
>>> it comes down at 110M/s with no delay on DNS lookup.
>>>
>>> Is there a better test I can do? - I am not so much a network guy...
>>> Cheers,
>>> Tim
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <akimball83@gmail.com>
wrote:
>>>> Tim,
>>>> Are there issues with DNS caching (or lack thereof), misconfigured
>>>> /etc/hosts, or other network-config gotchas that might be preventing network
>>>> connections between hosts from opening efficiently?
>>>> - Aaron
>>>>
>>>> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <timrobertson100@gmail.com>
>>>> wrote:
>>>>>
>>>>> Thanks Friso,
>>>>>
>>>>> We've been trying to diagnose all day and still did not find a solution.
>>>>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
>>>>> down to 1M 1R on each machine, and the machine CPUs are almost idle
>>>>> with no swap.
>>>>> Using curl to pull a file from a DN comes down at 110m/s.
>>>>>
>>>>> We are now upping things like epoll
>>>>>
>>>>> Any ideas really greatly appreciated at this stage!
>>>>> Tim
>>>>>
>>>>>
>>>>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
>>>>> <fvanvollenhoven@xebia.com> wrote:
>>>>>> Hi Tim,
>>>>>> Getting 28K of map outputs to reducers should not take minutes. Reducers
>>>>>> on
>>>>>> a properly setup (1Gb) network should be copying at multiple MB/s.
I
>>>>>> think
>>>>>> you need to get some more info.
>>>>>> Apart from top, you'll probably also want to look at iostat and vmstat.
>>>>>> The
>>>>>> first will tell you something about disk utilization and the latter
can
>>>>>> tell
>>>>>> you whether the machines are using swap or not. This is very important.
>>>>>> If
>>>>>> you are over utilizing physical memory on the machines, thing will
be
>>>>>> slow.
>>>>>> It's even better if you put something in place that allows you to
get an
>>>>>> overall view of the resource usage across the cluster. Look at Ganglia
>>>>>> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/)
or
>>>>>> something similar.
>>>>>> Basically a job is either CPU bound, IO bound or network bound. You
need
>>>>>> to
>>>>>> be able to look at all three to see what the bottleneck is. Also,
you
>>>>>> can
>>>>>> run into churn when you saturate resources and processes are competing
>>>>>> for
>>>>>> them (e.g. when you have two disks and 50 processes / threads reading
>>>>>> from
>>>>>> them, things will be slow because the OS needs to switch between
them a
>>>>>> lot
>>>>>> and overall throughput will be less than what the disks can do; you
can
>>>>>> see
>>>>>> this when there is a lot of time in iowait, but overall throughput
is
>>>>>> low so
>>>>>> there's a lot of seeks going on).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17 nov 2010, at 09:43, Tim Robertson wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We have setup a small cluster (13 nodes) using CDH3
>>>>>>
>>>>>> We have been tuning it using TeraSort and Hive queries on our data,
>>>>>> and the copy phase is very slow, so I'd like to ask if anyone can
look
>>>>>> over our config.
>>>>>>
>>>>>> We have an unbalanced set of machines (all on a single switch):
>>>>>> - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2
>>>>>> reducers)
>>>>>> - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
>>>>>> mappers, 12 reducers)
>>>>>>
>>>>>> We monitored the load using $top on machines, to settle on the number
>>>>>> of mappers and reducers to stop overloading them, and the map() and
>>>>>> reduce() is working very nicely - all our time
>>>>>>
>>>>>> The config:
>>>>>>
>>>>>> io.sort.mb=400
>>>>>> io.sort.factor=100
>>>>>> mapred.reduce.parallel.copies=20
>>>>>> tasktracker.http.threads=80
>>>>>> mapred.compress.map.output=true/false (no notible difference)
>>>>>> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
>>>>>> mapred.output.compression.type=BLOCK
>>>>>> mapred.inmem.merge.threshold=0
>>>>>> mapred.job.reduce.input.buffer.percent=0.7
>>>>>> mapred.job.reuse.jvm.num.tasks=50
>>>>>>
>>>>>> An example job:
>>>>>> (select basis_of_record,count(1) from occurrence_record group by
>>>>>> basis_of_record)
>>>>>> Map input records 262,573,931 finished in 2mins30 using 833 mappers
>>>>>> Reduce was at 24% at 2mins30 finished map with all 55 running
>>>>>> Map output records: 1,855
>>>>>> Map output bytes: 28,724
>>>>>> REDUCE COPY PHASE finished after 7mins01 secs
>>>>>> Reduce finished after 7mins17secs
>>>>>>
>>>>>> I am correct that 28,724 bytes emitted from a map should not take
>>>>>> 4mins30
>>>>>> right?
>>>>>>
>>>>>> We're running puppet so can test changes quickly.
>>>>>>
>>>>>> Any pointers on how we can debug / improve this are greatly appreciated!
>>>>>> Tim
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>

Mime
View raw message