Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
From: "Fuad Efendi" <fuad@efendi.ca>
To: <mapreduce-user@hadoop.apache.org>,
	<simon.willnauer@gmail.com>
References: <f18c9dde0912091029k8be34ffyed642a1118f6cfbb@mail.gmail.com>
 <f18c9dde0912091031h7f610305r32e92af777156ea0@mail.gmail.com>
In-Reply-To: <f18c9dde0912091031h7f610305r32e92af777156ea0@mail.gmail.com>
Subject: RE: Copyphase dies on timeout
Date: Wed, 9 Dec 2009 13:39:14 -0500
Message-ID: <004101ca78fe$ea6a6cf0$bf3f46d0$@ca>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: Acp4/dzNeCtT/gZ4TU2MqZ0eXo16HwAAD50A
Content-Language: en-ca

I had similar issues before; Reduce can do some 'additional' processing =
and have dependency on external services.

For instance, in my case it sends documents to SOLR (Lucene based). So =
that I have dependency on Lucene in a reducer. And, Lucene sometimes =
needs 30-50 minutes to perform index merge. If your timeout settings are =
not enough, job will be killed. You may also execute =
reporter.keepAlive() each 60 seconds (in a separate thread spawned by =
your reducer).

See at <name>mapred.task.timeout</name>, and use reporter.progress(), =
keepAlive(), etc.


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca
Data Mining, Vertical Search


> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Sent: December-09-09 1:31 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Copyphase dies on timeout
>=20
> Hi folks,
>=20
> I do have a questions regarding a map-reduce job who gets killed
> during the copy phase due to a timeout. I don't really understand the
> output completely so I first look for some help to understand what is
> going on.
> There are about 50 mapper running where each mapper succeeds. the job
> is configured with a single reduces (I assume this could be closely
> related to that?!)
> So this is the output:
>=20
> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Merging
> 5 sorted segments
> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Down to
> the last merge-pass, with 5 segments left of total size: 216092233
> bytes
> 2009-12-09 15:01:50,727 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Merge of the 5 files in-memory
> complete. Local file is
> /data/hadoop-
> =
data/mapred/local/taskTracker/jobcache/job_200912081402_0042/attempt_2009=
1
> 2081402_0042_r_000000_0/output/map_275.out
> of size 216092229
> 2009-12-09 15:01:50,728 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Thread waiting: Thread for
> merging on-disk files
> 2009-12-09 15:01:58,109 INFO org.apache.hadoop.mapred.ReduceTask: Read
> 53281055 bytes from map-output for
> attempt_200912081402_0042_m_000253_0
> 2009-12-09 15:01:58,148 INFO org.apache.hadoop.mapred.ReduceTask: Rec
> #1 from attempt_200912081402_0042_m_000253_0 -> (71, 103) from
> hadoop32
> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Need another 44 map output(s)
> where 3 is already in progress
> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
> and41 dup hosts)
> 2009-12-09 15:02:40,174 INFO org.apache.hadoop.mapred.ReduceTask: Read
> 81297732 bytes from map-output for
> attempt_200912081402_0042_m_000112_0
> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask: Rec
> #1 from attempt_200912081402_0042_m_000112_0 -> (129, 153) from
> hadoop31
> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts
> and16 dup hosts)
> 2009-12-09 15:02:40,223 INFO org.apache.hadoop.mapred.ReduceTask:
> Shuffling 82364774 bytes (82364778 raw bytes) into RAM from
> attempt_200912081402_0042_m_000121_0
> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Read
> 82722679 bytes from map-output for
> attempt_200912081402_0042_m_000196_0
> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Rec
> #1 from attempt_200912081402_0042_m_000196_0 -> (114, 137) from
> hadoop21
> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts
> and32 dup hosts)
> 2009-12-09 15:02:45,143 INFO org.apache.hadoop.mapred.ReduceTask:
> Shuffling 82639206 bytes (82639210 raw bytes) into RAM from
> attempt_200912081402_0042_m_000195_0
> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s)
> where 3 is already in progress
> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
> and39 dup hosts)
> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s)
> where 3 is already in progress
> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
> and39 dup hosts)
> ......
>=20
> Task attempt_200912081402_0042_r_000000_0 failed to report status for
> 902 seconds. Killing!
>=20
> my first guess would be some IO problem but I don not have any
> indicator that those logs. If somebody can clarify or give me a hint
> would be great. Even if it is something like: use more reducers
> because of ... <-- this is the interesting part at the moment!
> btw: this is hadoop-0.20
>=20
> simon