hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Lu <...@vicaya.com>
Subject Re: Hadoop cluster hangs on big hive job
Date Mon, 11 Mar 2013 21:44:22 GMT
You mean HDFS-4479?

The log seems to indicate the infamous jetty hang issue (MAPREDUCE-2386)
though.


On Mon, Mar 11, 2013 at 1:52 PM, Suresh Srinivas <suresh@hortonworks.com>wrote:

> I have seen one such problem related to big hive jobs that open a lot of
> files. See HDFS-4496 for more details. Snippet from the description:
> The following issue was observed in a cluster that was running a Hive job
> and was writing to 100,000 temporary files (each task is writing to 1000s
> of files). When this job is killed, a large number of files are left open
> for write. Eventually when the lease for open files expires, lease recovery
> is started for all these files in a very short duration of time. This
> causes a large number of commitBlockSynchronization where logSync is
> performed with the FSNamesystem lock held. This overloads the namenode
> resulting in slowdown.
>
> Could this be the cause? Can you see namenode log to see if you have lease
> recovery activity? If not, can you send some information about what is
> happening in the namenode logs at the time of this slowdown?
>
>
>
> On Mon, Mar 11, 2013 at 1:32 PM, Daning Wang <daning@netseer.com> wrote:
>
>> [hive@mr3-033 ~]$ hadoop version
>> Hadoop 1.0.4
>> Subversion
>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
>> 1393290
>> Compiled by hortonfo on Wed Oct  3 05:13:58 UTC 2012
>>
>>
>> On Sun, Mar 10, 2013 at 8:16 AM, Suresh Srinivas <suresh@hortonworks.com>wrote:
>>
>>> What is the version of hadoop?
>>>
>>> Sent from phone
>>>
>>> On Mar 7, 2013, at 11:53 AM, Daning Wang <daning@netseer.com> wrote:
>>>
>>> We have hive query processing zipped csv files. the query was scanning
>>> for 10 days(partitioned by date). data for each day around 130G. The
>>> problem is not consistent since if you run it again, it might go through.
>>> but the problem has never happened on the smaller jobs(like processing only
>>> one days data).
>>>
>>> We don't have space issue.
>>>
>>> I have attached log file when problem happening. it is stuck like
>>> following(just search "19706 of 49964")
>>>
>>> 2013-03-05 15:13:51,587 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:51,811 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:52,551 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:52,760 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:52,946 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>> 2013-03-05 15:13:54,742 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706 of
>>> 49964 at 0.00 MB/s) >
>>>
>>> Thanks,
>>>
>>> Daning
>>>
>>>
>>> On Thu, Mar 7, 2013 at 12:21 AM, Håvard Wahl Kongsgård <
>>> haavard.kongsgaard@gmail.com> wrote:
>>>
>>>> hadoop logs?
>>>> On 6. mars 2013 21:04, "Daning Wang" <daning@netseer.com> wrote:
>>>>
>>>>> We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while
>>>>> running big jobs. Basically all the nodes are dead, from that
>>>>> trasktracker's log looks it went into some kinds of loop forever.
>>>>>
>>>>> All the log entries like this when problem happened.
>>>>>
>>>>> Any idea how to debug the issue?
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>>
>>>>> 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000016_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000032_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000043_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000012_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000019_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000008_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000039_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000004_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000000_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000024_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:32,372 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000028_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>> 2013-03-05 15:13:32,893 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201302270947_0010_r_000036_0 0.131468% reduce > copy (19706
of
>>>>> 49964 at 0.00 MB/s) >
>>>>>
>>>>>
>>> <hadoop-hadoop3-tasktracker.log.gz>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Mime
View raw message