hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Smith" <mike.smith....@gmail.com>
Subject Re: some reducers stock in copying stage
Date Wed, 28 Feb 2007 22:13:31 GMT
Devaraj,

After applying patch 1043 the copying problem is solved. But, I am
getting new exceptions, but, the tasks will be finished after reassigning to
another tasktracker. So, the job gets done eventually. But, I never had this
exception before applying this patch (or could it be because of chaning
back-off time to 5 sec?):

java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.seek(FSDataInputStream.java
:74)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:121)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(
ChecksumFileSystem.java:217)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(
ChecksumFileSystem.java:163)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(
FSDataInputStream.java:41)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(
SequenceFile.java:427)
at org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(
SequenceFile.java:414)
at org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java
:1665)
at org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(
SequenceFile.java:2579)
at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(
SequenceFile.java:2351)
at org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java
:2226)
at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(
SequenceFile.java:2442)
at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2164)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:270)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1444)

java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.seek(FSDataInputStream.java
:74)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:121)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(
ChecksumFileSystem.java:217)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(
ChecksumFileSystem.java:163)
at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(
FSDataInputStream.java:41)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(
SequenceFile.java:427)
at org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(
SequenceFile.java:414)
at org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java
:1665)
at org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(
SequenceFile.java:2579)
at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(
SequenceFile.java:2351)
at org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java
:2226)
at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(
SequenceFile.java:2442)
at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2164)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:270)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1444)



On 2/28/07, Mike Smith <mike.smith.dev@gmail.com> wrote:
>
> Thanks Devaraj, patch 1042 seems to be already committed. Also, the system
> never recovered even after 1 min, 300 sec, it stocked there for hours. I
> will try patch 1043 and also decrease the back-off time to see if those help
>
>
> Mike
>
>
>  On 2/27/07, Devaraj Das <ddas@yahoo-inc.com> wrote:
> >
> > Mike,
> > The patches for h-1042 and h-1043 should address your situation better.
> > They
> > have not been committed as yet. Please apply the patches manually and
> > see
> > whether the situation improves.
> > Thanks,
> > Devaraj.
> >
> > > -----Original Message-----
> > > From: Devaraj Das [mailto:ddas@yahoo-inc.com]
> > > Sent: Wednesday, February 28, 2007 11:54 AM
> > > To: 'hadoop-dev@lucene.apache.org'
> > > Subject: RE: some reducers stock in copying stage
> > >
> > > Looks like all hosts (from which map outputs haven't yet been fetched)
> > are
> > > classified as being "slow". That is because there were failures
> > earlier
> > > while fetching outputs from those. When failures happen (maybe due to
> > > insufficient jetty server threads), there is a back-off for that host
> > and
> > > until the time in the back-off expires the outputs won't be fetched
> > from
> > > that particular host. The system should recover from this though.
> > Another
> > > thing you might want to try is to reduce the value of the
> > > mapred.reduce.copy.backoff to a value like 5 (the number of seconds,
> > by
> > > default it is 300 seconds). This will ensure that the back-off is
> > always
> > > less than or equal to 1 min,5 secs (1 min is the minimum hardcoded
> > > backoff).
> > >
> > > > -----Original Message-----
> > > > From: Mike Smith [mailto:mike.smith.dev@gmail.com]
> > > > Sent: Wednesday, February 28, 2007 8:45 AM
> > > > To: hadoop-dev@lucene.apache.org
> > > > Subject: some reducers stock in copying stage
> > > >
> > > > After updating the hadoop trunk today, I am having problem at the
> > > reducing
> > > > phase. Some of the reducers stock in the copying stage (very end of
> > > > copying)
> > > > and they keep reporting the same status, even when I kill the
> > related
> > > > tasktracker, the job traker still reports the copying. Here is the
> > log:
> > > >
> > > > 2007-02-27 22:08:26,388 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000224_0 Got 24 known map output location(s);
> > scheduling...
> > > > 2007-02-27 22:08:26,388 INFO org.apache.hadoop.mapred.TaskRunner :
> > > > task_0001_r_000224_0 Scheduled 0 of 24 known outputs (24 slow hosts
> > and
> > > 0
> > > > dup hosts)
> > > > 2007-02-27 22:08:27,204 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:27,204 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:28,214 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:28,214 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:29,224 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:29,224 INFO org.apache.hadoop.mapred.TaskTracker :
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:30,114 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Need 11 map output(s)
> > > > 2007-02-27 22:08:30,114 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Need 234 map output location(s)
> > > > 2007-02-27 22:08:30,116 INFO org.apache.hadoop.mapred.TaskRunner :
> > > > task_0001_r_000111_0 Got 0 new map outputs from jobtracker and 0 map
> > > > outputs
> > > > from previous failures
> > > > 2007-02-27 22:08:30,116 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Got 11 known map output location(s);
> > scheduling...
> > > > 2007-02-27 22:08:30,116 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Scheduled 0 of 11 known outputs (11 slow hosts
> > and
> > > 0
> > > > dup hosts)
> > > > 2007-02-27 22:08:30,234 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:30,234 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:31,244 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:31,244 INFO org.apache.hadoop.mapred.TaskTracker :
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:31,394 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000224_0 Need 24 map output(s)
> > > > 2007-02-27 22:08:31,394 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000224_0 Need 133 map output location(s)
> > > > 2007-02-27 22:08:31,395 INFO org.apache.hadoop.mapred.TaskRunner :
> > > > task_0001_r_000224_0 Got 0 new map outputs from jobtracker and 0 map
> > > > outputs
> > > > from previous failures
> > > > 2007-02-27 22:08:31,395 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000224_0 Got 24 known map output location(s);
> > scheduling...
> > > > 2007-02-27 22:08:31,395 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000224_0 Scheduled 0 of 24 known outputs (24 slow hosts
> > and
> > > 0
> > > > dup hosts)
> > > > 2007-02-27 22:08:32,254 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:32,254 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:33,264 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:33,264 INFO org.apache.hadoop.mapred.TaskTracker :
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:34,274 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:34,274 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:35,124 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Need 11 map output(s)
> > > > 2007-02-27 22:08:35,124 INFO org.apache.hadoop.mapred.TaskRunner :
> > > > task_0001_r_000111_0 Need 234 map output location(s)
> > > > 2007-02-27 22:08:35,219 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Got 0 new map outputs from jobtracker and 0 map
> >
> > > > outputs
> > > > from previous failures
> > > > 2007-02-27 22:08:35,219 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Got 11 known map output location(s);
> > scheduling...
> > > > 2007-02-27 22:08:35,219 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000111_0 Scheduled 0 of 11 known outputs (11 slow hosts
> > and
> > > 0
> > > > dup hosts)
> > > > 2007-02-27 22:08:35,284 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:35,284 INFO org.apache.hadoop.mapred.TaskTracker :
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:36,294 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000224_0 0.33083335% reduce > copy (3176 of 3200 at 1.94
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:36,294 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > task_0001_r_000111_0 0.3321875% reduce > copy (3189 of 3200 at 0.40
> > > MB/s)
> > > > >
> > > > 2007-02-27 22:08:36,404 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000224_0 Need 24 map output(s)
> > > > 2007-02-27 22:08:36,404 INFO org.apache.hadoop.mapred.TaskRunner :
> > > > task_0001_r_000224_0 Need 133 map output location(s)
> > > > 2007-02-27 22:08:36,422 INFO org.apache.hadoop.mapred.TaskRunner:
> > > > task_0001_r_000224_0 Got 0 new map outputs from jobtracker and 0 map
> >
> > > > outputs
> > > > from previous
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message