hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patai Sangbutsarakum <silvianhad...@gmail.com>
Subject Re: Jobtracker page hangs ..again.
Date Tue, 13 Aug 2013 08:25:02 GMT
Thanks harsh, 
Appreciate your input, as always.

On Aug 12, 2013, at 20:01, Harsh J <harsh@cloudera.com> wrote:

> If you're not already doing it, run a local name caching daemon (such
> as ncsd, etc.) on each cluster node. Hadoop does a lot of lookups and
> a local cache would do good in reducing the load on your DNS.
> 
> On Tue, Aug 13, 2013 at 3:09 AM, Patai Sangbutsarakum
> <silvianhadoop@gmail.com> wrote:
>> Update, after adjust the network routing, dns query speed is in micro sec as
>> suppose to be. the issue is completely solve.
>> Jobtracker page doesn't hang anymore when launch 100k mappers job..
>> 
>> Cheers,
>> 
>> 
>> 
>> On Mon, Aug 12, 2013 at 1:29 PM, Patai Sangbutsarakum
>> <silvianhadoop@gmail.com> wrote:
>>> 
>>> Ok, after some sweat, i think I found the root cause but still need
>>> another team to help me fix it.
>>> It lines on the DNS.  Somehow each of the tip:task line, through the
>>> tcpdump, i saw the dns query to dns server. Timestamp seems matches to me.
>>> 
>>> 2013-08-11 20:39:23,493 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host1
>>> 
>>> 127 ms
>>> 
>>> 2013-08-11 20:39:23,620 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack1/host2
>>> 
>>> 126 ms
>>> 
>>> 2013-08-11 20:39:23,746 INFO org.apache.hadoop.mapred.JobInProgress:
>>> tip:task_201308111631_0006_m_000000 has split on node:/rack2/host3
>>> 
>>> 
>>> 20:39:23.367337 IP jtk.53110 > dns1.domain: 41717+ A? host1. (37)
>>> 
>>> 20:39:23.367345 IP jtk.53110 > dns1.domain: 7221+ AAAA? host1. (37)
>>> 
>>> 20:39:23.493486 IP dns1.domain > jtk.53110: 7221* 0/1/0 (89)
>>> 
>>> 20:39:23.493505 IP dns1.domain > : jtk.41717* 1/4/2 A xx.xx.xx.xx (189)
>>> 
>>> 
>>> 20:39:23.493766 IP jtk.48042 > dns1.domain: 35450+ A? host2. (37)
>>> 
>>> 20:39:23.493774 IP jtk.48042 > dns1.domain: 56007+ AAAA? host2. (37)
>>> 
>>> 20:39:23.619903 IP dns1.domain > jtk.48042: 35450* 1/4/2 A yy.yy.yy.yy
>>> (189)
>>> 
>>> 20:39:23.619921 IP dns1.domain > jtk.48042: 56007* 0/1/0 (89)
>>> 
>>> 
>>> 20:39:23.620208 IP jtk.41237 > dns2.domain: 49511+ A? host3. (37)
>>> 
>>> 20:39:23.620215 IP jtk.41237 > dns2.domain: 29199+ AAAA? host3. (37)
>>> 
>>> 20:39:23.746358 IP dns2.domain > jtk.41237: 49511* 1/4/2 A zz.zz.zz.zz
>>> (189)
>>> 
>>> I looked at the jobtracker log in other datacenter when submitted with the
>>> same query. Timestamp in each tip:task line is much much faster.
>>> 
>>> The question that raise here is the job initialization is really request
>>> the DNS, if so is there any way to suppress that. topology file is already
>>> in place where name and ip are already there.
>>> 
>>> 
>>> Hope this make sense
>>> 
>>> Patai
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Aug 9, 2013 at 6:57 PM, Patai Sangbutsarakum
>>> <silvianhadoop@gmail.com> wrote:
>>>> 
>>>> Appreciate your input Bryant, i will try to reproduce and see the
>>>> namenode log before, while, and after it pause.
>>>> Wish me luck
>>>> 
>>>> 
>>>> On Fri, Aug 9, 2013 at 2:09 PM, Bryan Beaudreault
>>>> <bbeaudreault@hubspot.com> wrote:
>>>>> 
>>>>> When I've had problems with a slow jobtracker, i've found the issue to
>>>>> be one of the following two (so far) possibilities:
>>>>> 
>>>>> - long GC pause (I'm guessing this is not it based on your email)
>>>>> - hdfs is slow
>>>>> 
>>>>> I haven't dived into the code yet, but circumstantially I've found that
>>>>> when you submit a job the jobtracker needs to put a bunch of files in
hdfs,
>>>>> such as the job.xml, the job jar, etc.  I'm not sure how this scales
with
>>>>> larger and larger jobs, aside form the size of the splits serialization
in
>>>>> the job.xml, but if your HDFS is slow for any reason it can cause pauses
in
>>>>> your jobtracker.  This affects other jobs being able to submit, as well
as
>>>>> the 50030 web ui.
>>>>> 
>>>>> I'd take a look at your namenode logs.  When the jobtracker logs pause,
>>>>> do you see a corresponding pause in the namenode logs?  What gets spewed
>>>>> before and after that pause?
>>>>> 
>>>>> 
>>>>> On Fri, Aug 9, 2013 at 4:41 PM, Patai Sangbutsarakum
>>>>> <silvianhadoop@gmail.com> wrote:
>>>>>> 
>>>>>> A while back, i was fighting with the jobtracker page hangs when
i
>>>>>> browse to http://jobtracker:50030 browser doesn't show jobs info
as usual
>>>>>> which ends up because of allowing too much job history kept in jobtracker.
>>>>>> 
>>>>>> Currently, i am setting up a new cluster 40g heap on the namenode
and
>>>>>> jobtracker in dedicated machines. Not fun part starts here; a developer
>>>>>> tried to test out the cluster by launching a 76k map job (the cluster
has
>>>>>> around 6k-ish mappers)
>>>>>> Job initialization was success, and finished the job.
>>>>>> 
>>>>>> However, before the job is actually running, i can't access to the
>>>>>> jobtracker page anymore same symptom as above.
>>>>>> 
>>>>>> i see bunch of this in jobtracker log
>>>>>> 
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> tip:task_201307291733_0619_m_076796 has split on node: /rack/node
>>>>>> ..
>>>>>> ..
>>>>>> ..
>>>>>> 
>>>>>> Until i see this
>>>>>> 
>>>>>> INFO org.apache.hadoop.mapred.JobInProgress: job_201307291733_0619
>>>>>> LOCALITY_WAIT_FACTOR=1.0
>>>>>> 2013-08-08 00:23:00,509 INFO org.apache.hadoop.mapred.JobInProgress:
>>>>>> Job job_201307291733_0619 initialized successfully with 76797 map
tasks and
>>>>>> 10 reduce tasks.
>>>>>> 
>>>>>> that's when i can access to the jobtracker page again.
>>>>>> 
>>>>>> 
>>>>>> CPU on jobtracker is very little load, JTK's Heap is far from full
like
>>>>>> 1ish gig from 40
>>>>>> network bandwidth is far from filled up.
>>>>>> 
>>>>>> I'm running on 0.20.2 branch on CentOS6.4 with Java(TM) SE Runtime
>>>>>> Environment (build 1.6.0_32-b05)
>>>>>> 
>>>>>> 
>>>>>> What would be the root cause i should looking at or at least where
to
>>>>>> start?
>>>>>> 
>>>>>> Thanks you in advanced
> 
> 
> 
> -- 
> Harsh J

Mime
View raw message