hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bobby Dennett" <softw...@bobby.fastmail.us>
Subject Re: Preventing/Limiting NotReplicatedYetException exceptions
Date Mon, 26 Jul 2010 17:32:32 GMT
Just following up again as this issue is becoming a high priority for us
since it is affecting a critical process...

Can anyone provide some insight as to what you would look for in the
logs to troubleshoot this issue?

Up until the occurrences of the NotReplicatedYetException exception, we
only have "java.io.IOException: Could not complete write to file" errors
referring to files that correspond to killed tasks (e.g. tasks/attempts
launched due to speculative execution).

Is there a way to troubleshoot the first case Alex mentions (related to
speculative execution)? Please note that we did not see these errors in
our previous v0.19.1 cluster (whose servers were added to our current
v0.20.1 cluster).

Regarding the network, we haven't seen any problems and have ensured
network bonding is disabled to try and avoid interfaces failing over and
failing back unnecessarily.

Lastly, would it make sense to upgrade to the latest version of v0.20.1
Cloudera Hadoop?

Thanks in advance,
-Bobby


On Wed, 21 Jul 2010 17:31 -0700, "Alex Kozlov"
<alexvk@cloudera.com> wrote:

  Hi Bobby,
  It's hard to debug this without seeing the actual logs, but
  we've seen these error in at least two cases:
  - The file is modified while a client is writing to it (like
  in the case if speculative execution is not implemented
  correctly and the two tasks are writing into the same file)
  - Network problems (like dropped frames)
  Hope this helps to debug your specific issue,
  Alex K

On Wed, Jul 21, 2010 at 5:02 PM, Bobby Dennett
<[1]software@bobby.fastmail.us> wrote:

  Hi all,
  We recently finished migrating from a modified v0.19.1 Apache
  Hadoop
  cluster to a v0.20.1+169.68 Cloudera Hadoop cluster and now
  encounter
  org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetExcepti
  on
  exceptions periodically, which end up affecting at least one
  of our
  production processes. The exceptions we see are similar to
  what is shown
  in the following NameNode log snippet and generally come from
  reduce
  tasks:
  2010-07-21 04:10:29,749 INFO
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
  ugi=se,se
   ip=/[2]10.59.32.29 cmd=open
  src=/data/prod/imp/20100718/impress.20100718-18-8.0.imp
  dst=null
  perm=null
  2010-07-21 04:10:29,871 INFO
  org.apache.hadoop.hdfs.StateChange: BLOCK*
  NameSystem.addStoredBlock: blockMap updated:
  [3]10.59.32.75:8055 is added
  to blk_2175464249485514619_6665818 size 67108864
  2010-07-21 04:10:29,905 INFO
  org.apache.hadoop.hdfs.StateChange: BLOCK*
  NameSystem.addStoredBlock: blockMap updated:
  [4]10.59.32.65:8055 is added
  to blk_-9186105910613839757_6665818 size 67108864
  2010-07-21 04:10:29,907 INFO
  org.apache.hadoop.hdfs.StateChange: BLOCK*
  NameSystem.allocateBlock:
  /user/kaduindexer-17143/us/201007210300/joined_deals_csv_1/_te
  mporary/_attempt_201007161809_1158_r_000013_0/part-00013.
  blk_-4046077678219842147_6665819
  2010-07-21 04:10:29,941 INFO org.apache.hadoop.ipc.Server: IPC
  Server
  handler 16 on 8080, call
  addBlock(/user/kaduindexer-17143/us/201007210300/joined_deals_
  csv_1/_temporary/_attempt_201007161809_1158_r_000033_0/part-00
  033,
  DFSClient_attempt_201007161809_1158_r_000033_0, null) from
  [5]10.59.32.40:32986: error:
  org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetExcepti
  on: Not
  replicated
  yet:/user/kaduindexer-17143/us/201007210300/joined_deals_csv_1
  /_temporary/_attempt_201007161809_1158_r_000033_0/part-00033
  org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetExcepti
  on: Not
  replicated
  yet:/user/kaduindexer-17143/us/201007210300/joined_deals_csv_1
  /_temporary/_attempt_201007161809_1158_r_000033_0/part-00033
         at

  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditio
  nalBlock(FSNamesystem.java:1268)
         at

  org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameN
  ode.java:469)
         at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown
  Source)
         at

  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
  odAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
         at
  org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
         at
  org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
         at java.security.AccessController.doPrivileged(Native
  Method)
         at javax.security.auth.Subject.doAs(Subject.java:396)
         at
  org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
  2010-07-21 04:10:29,988 INFO
  org.apache.hadoop.hdfs.StateChange: BLOCK*
  NameSystem.addStoredBlock: blockMap updated:
  [6]10.59.32.65:8055 is added
  to blk_5254948147841831392_6665819 size 67108864
  2010-07-21 04:10:30,108 INFO
  org.apache.hadoop.hdfs.StateChange: BLOCK*
  NameSystem.addStoredBlock: blockMap updated:
  [7]10.59.32.36:8055 is added
  to blk_-6649788991026991959_6665818 size 67108864
  2010-07-21 04:10:30,156 INFO
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
  ugi=se,se
   ip=/[8]10.59.32.53 cmd=open
  src=/data/prod/imp/20100718/impress.20100718-19-0.1.imp
  dst=null
  perm=null
  The assumption is that these exceptions are due to cluster
  load
  (particularly I/O; as our current infrastructure only allows
  for 1 drive
  on each node to be available to Hadoop) but we cannot always
  point to a
  specific cause. For instance, looking at server "metrics" for
  the master
  node and associated datanode at the time of the error above,
  the load on
  each server did not appear to be alarmingly high.
  Note that we have tried increasing the values for the
  dfs.datanode.handler.count and dfs.namenode.handler.count
  parameters as
  suggested in a previous thread but we have not seen much
  impact in
  reducing these exceptions.
  Can anyone suggest how to best troubleshoot this issue and/or
  suggest
  possible "fixes" that could limit the occurrences of the
  NotReplicatedYetException exception?
  Please note that we did not encounter these exceptions when
  running
  similar jobs on our v0.19.1 cluster.
  In case it helps, below is some information about our cluster:
   Hadoop Version: 0.20.1+169.68
   Number of Nodes: 64
   Map Task Capacity: 192
   Reduce Task Capacity: 128
   Hadoop Heap Size: 6,000 MB (NN) / 2,000 MB (DN)
   Node Information: Dell 1950 (4 core, Xeon @ 2.33GHz), Ubuntu
  8.04
   64-bit, 1 TB disk for Hadoop, 32 GB RAM
   Block Size: 128 MB
   Replication: 3
  Thanks in advance,
  -Bobby

References

1. mailto:software@bobby.fastmail.us
2. http://10.59.32.29/
3. http://10.59.32.75:8055/
4. http://10.59.32.65:8055/
5. http://10.59.32.40:32986/
6. http://10.59.32.65:8055/
7. http://10.59.32.36:8055/
8. http://10.59.32.53/

Mime
View raw message