hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bobby Dennett" <softw...@bobby.fastmail.us>
Subject Preventing/Limiting NotReplicatedYetException exceptions
Date Thu, 22 Jul 2010 00:02:37 GMT
Hi all,

We recently finished migrating from a modified v0.19.1 Apache Hadoop
cluster to a v0.20.1+169.68 Cloudera Hadoop cluster and now encounter
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException
exceptions periodically, which end up affecting at least one of our
production processes. The exceptions we see are similar to what is shown
in the following NameNode log snippet and generally come from reduce
tasks:

2010-07-21 04:10:29,749 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=se,se    
  ip=/10.59.32.29 cmd=open       
src=/data/prod/imp/20100718/impress.20100718-18-8.0.imp dst=null       
perm=null
2010-07-21 04:10:29,871 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.59.32.75:8055 is added
to blk_2175464249485514619_6665818 size 67108864
2010-07-21 04:10:29,905 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.59.32.65:8055 is added
to blk_-9186105910613839757_6665818 size 67108864
2010-07-21 04:10:29,907 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/user/kaduindexer-17143/us/201007210300/joined_deals_csv_1/_temporary/_attempt_201007161809_1158_r_000013_0/part-00013.
blk_-4046077678219842147_6665819
2010-07-21 04:10:29,941 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 16 on 8080, call
addBlock(/user/kaduindexer-17143/us/201007210300/joined_deals_csv_1/_temporary/_attempt_201007161809_1158_r_000033_0/part-00033,
DFSClient_attempt_201007161809_1158_r_000033_0, null) from
10.59.32.40:32986: error:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
replicated
yet:/user/kaduindexer-17143/us/201007210300/joined_deals_csv_1/_temporary/_attempt_201007161809_1158_r_000033_0/part-00033
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
replicated
yet:/user/kaduindexer-17143/us/201007210300/joined_deals_csv_1/_temporary/_attempt_201007161809_1158_r_000033_0/part-00033
        at
        org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1268)
        at
        org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)
        at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
        at
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
2010-07-21 04:10:29,988 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.59.32.65:8055 is added
to blk_5254948147841831392_6665819 size 67108864
2010-07-21 04:10:30,108 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.59.32.36:8055 is added
to blk_-6649788991026991959_6665818 size 67108864
2010-07-21 04:10:30,156 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=se,se    
  ip=/10.59.32.53 cmd=open       
src=/data/prod/imp/20100718/impress.20100718-19-0.1.imp dst=null       
perm=null

The assumption is that these exceptions are due to cluster load
(particularly I/O; as our current infrastructure only allows for 1 drive
on each node to be available to Hadoop) but we cannot always point to a
specific cause. For instance, looking at server "metrics" for the master
node and associated datanode at the time of the error above, the load on
each server did not appear to be alarmingly high. 

Note that we have tried increasing the values for the
dfs.datanode.handler.count and dfs.namenode.handler.count parameters as
suggested in a previous thread but we have not seen much impact in
reducing these exceptions.

Can anyone suggest how to best troubleshoot this issue and/or suggest
possible "fixes" that could limit the occurrences of the
NotReplicatedYetException exception? 

Please note that we did not encounter these exceptions when running
similar jobs on our v0.19.1 cluster.

In case it helps, below is some information about our cluster:
  Hadoop Version: 0.20.1+169.68
  Number of Nodes: 64
  Map Task Capacity: 192
  Reduce Task Capacity: 128
  Hadoop Heap Size: 6,000 MB (NN) / 2,000 MB (DN)
  Node Information: Dell 1950 (4 core, Xeon @ 2.33GHz), Ubuntu 8.04
  64-bit, 1 TB disk for Hadoop, 32 GB RAM
  Block Size: 128 MB
  Replication: 3

Thanks in advance,
-Bobby

Mime
View raw message