hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Curtin <curtin.ch...@gmail.com>
Subject Job not exiting on data node, so cluster hangs?
Date Thu, 17 Mar 2011 12:17:35 GMT
Hi,

Our 4 node cluster has been hanging up almost every night for the last
couple of weeks.

We have issues in a couple of places according to the logs.

First: INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
/offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
retrying...
(we get thousands of these until we reboot the cluster)

This happens after what appears to be a job hanging on a data node. The
hadoop logs on the local machine and the jobtracker,  for the job say it
completed, but the java process is still around AND I find this in the
stderr file for the job:

Exception in thread "IPC Client (47) connection to /127.0.0.1:60365 from
hadoop" java.lang.RuntimeException: readObject can't find class
        at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:185)
        at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
        at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:511)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
Caused by: java.lang.ClassNotFoundException:
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
        at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:183)
        ... 3 more
Once this happens, I see it occur on a couple of other Datanodes for other
jobs then the cluster stops responding. No UI, and 'hadoop job -list all'
hangs as well.

We are running 0.20.2, r911707

Any suggestions what is going on?

Thanks,

Chris

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message