hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bobby Dennett" <softw...@bobby.fastmail.us>
Subject Preventing/Limiting NotReplicatedYetException exceptions
Date Thu, 22 Jul 2010 00:02:37 GMT
Hi all,

We recently finished migrating from a modified v0.19.1 Apache Hadoop
cluster to a v0.20.1+169.68 Cloudera Hadoop cluster and now encounter
exceptions periodically, which end up affecting at least one of our
production processes. The exceptions we see are similar to what is shown
in the following NameNode log snippet and generally come from reduce

2010-07-21 04:10:29,749 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=se,se    
  ip=/ cmd=open       
src=/data/prod/imp/20100718/impress.20100718-18-8.0.imp dst=null       
2010-07-21 04:10:29,871 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: is added
to blk_2175464249485514619_6665818 size 67108864
2010-07-21 04:10:29,905 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: is added
to blk_-9186105910613839757_6665818 size 67108864
2010-07-21 04:10:29,907 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
2010-07-21 04:10:29,941 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 16 on 8080, call
DFSClient_attempt_201007161809_1158_r_000033_0, null) from error:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
        at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
2010-07-21 04:10:29,988 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: is added
to blk_5254948147841831392_6665819 size 67108864
2010-07-21 04:10:30,108 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: is added
to blk_-6649788991026991959_6665818 size 67108864
2010-07-21 04:10:30,156 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=se,se    
  ip=/ cmd=open       
src=/data/prod/imp/20100718/impress.20100718-19-0.1.imp dst=null       

The assumption is that these exceptions are due to cluster load
(particularly I/O; as our current infrastructure only allows for 1 drive
on each node to be available to Hadoop) but we cannot always point to a
specific cause. For instance, looking at server "metrics" for the master
node and associated datanode at the time of the error above, the load on
each server did not appear to be alarmingly high. 

Note that we have tried increasing the values for the
dfs.datanode.handler.count and dfs.namenode.handler.count parameters as
suggested in a previous thread but we have not seen much impact in
reducing these exceptions.

Can anyone suggest how to best troubleshoot this issue and/or suggest
possible "fixes" that could limit the occurrences of the
NotReplicatedYetException exception? 

Please note that we did not encounter these exceptions when running
similar jobs on our v0.19.1 cluster.

In case it helps, below is some information about our cluster:
  Hadoop Version: 0.20.1+169.68
  Number of Nodes: 64
  Map Task Capacity: 192
  Reduce Task Capacity: 128
  Hadoop Heap Size: 6,000 MB (NN) / 2,000 MB (DN)
  Node Information: Dell 1950 (4 core, Xeon @ 2.33GHz), Ubuntu 8.04
  64-bit, 1 TB disk for Hadoop, 32 GB RAM
  Block Size: 128 MB
  Replication: 3

Thanks in advance,

View raw message