hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bejoy.had...@gmail.com
Subject Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.
Date Fri, 11 Jan 2013 06:51:01 GMT
Hi

To add on to Harsh's comments.

You need not have to change the task time out.

In your map/reduce code, you can increment a counter or report status intermediate  on intervals
so that there is communication from the task and hence won't have a task time out.

Every map and reduce task run on its own jvm limited by a jvm size. If you try to holds too
much data in memory then it can go beyond the jvm size and cause OOM errors. 


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: yaotian <yaotian@gmail.com>
Date: Fri, 11 Jan 2013 14:35:07 
To: <user@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

See inline.


2013/1/11 Harsh J <harsh@cloudera.com>

> If the per-record processing time is very high, you will need to
> periodically report a status. Without a status change report from the task
> to the tracker, it will be killed away as a dead task after a default
> timeout of 10 minutes (600s).
>
=====================> Do you mean to increase the report time: "*
mapred.task.timeout"*?


> Also, beware of holding too much memory in a reduce JVM - you're still
> limited there. Best to let the framework do the sort or secondary sort.
>
=======================>  You mean use the default value ? This is my value.
*mapred.job.reduce.memory.mb*-1

>
>
> On Fri, Jan 11, 2013 at 10:58 AM, yaotian <yaotian@gmail.com> wrote:
>
>> Yes, you are right. The data is GPS trace related to corresponding uid.
>> The reduce is doing this: Sort user to get this kind of result: uid, gps1,
>> gps2, gps3........
>> Yes, the gps data is big because this is 30G data.
>>
>> How to solve this?
>>
>>
>>
>> 2013/1/11 Mahesh Balija <balijamahesh.mca@gmail.com>
>>
>>> Hi,
>>>
>>>           2 reducers are successfully completed and 1498 have been
>>> killed. I assume that you have the data issues. (Either the data is huge or
>>> some issues with the data you are trying to process)
>>>           One possibility could be you have many values associated to a
>>> single key, which can cause these kind of issues based on the operation you
>>> do in your reducer.
>>>           Can you put some logs in your reducer and try to trace out
>>> what is happening.
>>>
>>> Best,
>>> Mahesh Balija,
>>> Calsoft Labs.
>>>
>>>
>>> On Fri, Jan 11, 2013 at 8:53 AM, yaotian <yaotian@gmail.com> wrote:
>>>
>>>> I have 1 hadoop master which name node locates and 2 slave which
>>>> datanode locate.
>>>>
>>>> If i choose a small data like 200M, it can be done.
>>>>
>>>> But if i run 30G data, Map is done. But the reduce report error. Any
>>>> sugggestion?
>>>>
>>>>
>>>> This is the information.
>>>>
>>>> *Black-listed TaskTrackers:* 1<http://23.20.27.135:9003/jobblacklistedtrackers.jsp?jobid=job_201301090834_0041>
>>>> ------------------------------
>>>> Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
>>>> Task Attempts<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041>
>>>> map<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1>
>>>> 100.00%4500 0450<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=map&pagenum=1&state=completed>
>>>> 00 / 1<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=map&cause=killed>
>>>> reduce<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1>
>>>> 100.00%1500 0 02<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=completed>
>>>> 1498<http://23.20.27.135:9003/jobtasks.jsp?jobid=job_201301090834_0041&type=reduce&pagenum=1&state=killed>
>>>> 12<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=failed>
>>>>  / 3<http://23.20.27.135:9003/jobfailures.jsp?jobid=job_201301090834_0041&kind=reduce&cause=killed>
>>>>
>>>>
>>>> TaskCompleteStatusStart TimeFinish TimeErrorsCounters
>>>> task_201301090834_0041_r_000001<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000001>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 44sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000001_0 failed to report status for 600
seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_1 failed to report status for 602
seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_2 failed to report status for 602
seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000001_3 failed to report status for 602
seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000001>
>>>> task_201301090834_0041_r_000002<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000002>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:54
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 43sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000002_0 failed to report status for 601
seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000002_1 failed to report status for 600
seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000002>
>>>> task_201301090834_0041_r_000003<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000003>
>>>> 0.00%
>>>> 10-Jan-2013 04:18:57
>>>> 10-Jan-2013 06:46:38 (2hrs, 27mins, 41sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000003_0 failed to report status for 602
seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_1 failed to report status for 602
seconds. Killing!
>>>> Task attempt_201301090834_0041_r_000003_2 failed to report status for 602
seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000003>
>>>> task_201301090834_0041_r_000005<http://23.20.27.135:9003/taskdetails.jsp?tipid=task_201301090834_0041_r_000005>
>>>> 0.00%
>>>> 10-Jan-2013 06:11:07
>>>> 10-Jan-2013 06:46:38 (35mins, 31sec)
>>>>
>>>> Task attempt_201301090834_0041_r_000005_0 failed to report status for 600
seconds. Killing!
>>>>
>>>>
>>>> 0<http://23.20.27.135:9003/taskstats.jsp?tipid=task_201301090834_0041_r_000005>
>>>>
>>>
>>>
>>
>
>
> --
> Harsh J
>

Mime
View raw message