hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: io.skip.checksum.errors was: Re: Hung job
Date Mon, 13 Mar 2006 01:59:45 GMT
Hi Stack,
I tried that but sorry to say it does not solve the problem - also  
with the very latest hadoop update to nutch.

@Doug as far I understand the latest update already contains this  
rollback "Reverted changes from 384385"?

To summarize I'm not able to run just one larger hadoop job e.g.  
nutch fetching never worked.
Always one tasktracker hangs in the last 10 % of doing a reduce (some  
times it arrived 98 % ) than the task runs into a time out and is  
restarted again, fails, restarted, fails ...
After 4 fails the task in progress crash the complete job.

There is just nothing in the log files, that gives any hint also I  
configured to ignore check sum exceptions.

Any ideas where to search for the problem? It is very sad to throw a  
fetched segment with several million documents just to trash,
  since the last 3 % of reducing of one tasktracker fail.

Thanks for any comments.
Stefan


Am 12.03.2006 um 03:33 schrieb Stefan Groschupf:

> Hi Stack,
>
> try set the io.skip.checksum.errors to true in your hadoop-site.xml
>
> <property>
>   <name>io.skip.checksum.errors</name>
>   <value>false</value>
>   <description>If true, when a checksum error is encountered while
>   reading a sequence file, entries are skipped, instead of throwing an
>   exception.</description>
> </property>
>
> This may be solve your problem but I agree that there should be a  
> smarter way than just ignoring it. :)
>
> Stefan
>
>
> Am 10.03.2006 um 20:22 schrieb stack:
>
>> On hadoop-users, I've described two recent hangs.  I'm writing  
>> here to dev because I'm looking for pointers on how best to  
>> conjure a fix with perhaps pointers to any existing facility I  
>> might exploit (I do not know the codebase well).
>>
>> In synopsis the problem goes as follows:
>>
>> If a reduce cannot pick up map outputs -- for example, the output  
>> has been moved aside because of a ChecksumException (See below  
>> stack trace) -- then the job gets stuck with the reduce task  
>> trying and failing every ten seconds or so to pick up the non- 
>> existent map output part.
>>
>> Somehow the reduce needs to give up and the jobtracker needs to  
>> rerun the map just as it would if the tasktracker had died  
>> completely.
>>
>> Thanks in advance for any pointers,
>> St.Ack
>>
>>
>> 060309 014426 Moving bad file /0/hadoop/tmp/task_m_bq2g76/ 
>> part-20.out to
>> /0/bad_files/part-20.out.2002824050
>> 060309 014426 Server handler 0 on 50040 caught:
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /0/hadoop/tmp/task_m_bq2g76/part-20.out at 2649600
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /0/hadoop/tmp/task_m_bq2g76/part-20.out at 2649600
>>      at
>> org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum 
>> (FSDataInputStream.java:122)
>>      at
>> org.apache.hadoop.fs.FSDataInputStream$Checker.read 
>> (FSDataInputStream.java:98)
>>      at
>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read 
>> (FSDataInputStream.java:158)
>>      at java.io.BufferedInputStream.read1(BufferedInputStream.java: 
>> 254)
>>      at java.io.BufferedInputStream.read(BufferedInputStream.java: 
>> 313)
>>      at java.io.DataInputStream.read(DataInputStream.java:80)
>>      at org.apache.hadoop.mapred.MapOutputFile.write 
>> (MapOutputFile.java:110)
>>      at
>> org.apache.hadoop.io.ObjectWritable.writeObject 
>> (ObjectWritable.java:117)
>>      at org.apache.hadoop.io.ObjectWritable.write 
>> (ObjectWritable.java:64)
>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215)
>>
>>
>>
>
> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com
>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com



Mime
View raw message