hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject could only be replicated to 0 nodes, instead of 1
Date Tue, 04 Sep 2012 16:41:26 GMT
I've been running up against the good old fashioned "replicated to 0 nodes" gremlin quite a
bit recently.  My system (a set of processes interacting with hadoop, and of course hadoop
itself) runs for a while (a day or so) and then I get plagued with these errors.  This is
a very simple system, a single node running pseudo-distributed.  Obviously, the replication
factor is implicitly 1 and the datanode is the same machine as the namenode.  None of the
typical culprits seem to explain the situation and I'm not sure what to do.  I'm also not
sure how I'm getting around it so far.  I fiddle desperately for a few hours and things start
running again, but that's not really a solution...I've tried stopping and restarting hdfs,
but that doesn't seem to improve things.

So, to go through the common suspects one by one, as quoted on http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo:

• No DataNode instances being up and running. Action: look at the servers, see if the processes
are running.

I can interact with hdfs through the command line (doing directory listings for example).
 Furthermore, I can see that the relevant java processes are all running (NameNode, SecondaryNameNode,
DataNode, JobTracker, TaskTracker).

• The DataNode instances cannot talk to the server, through networking or Hadoop configuration
problems. Action: look at the logs of one of the DataNodes.

Obviously irrelevant in a single-node scenario.  Anyway, like I said, I can perform basic
hdfs listings, I just can't upload new data.

• Your DataNode instances have no hard disk space in their configured data directories.
Action: look at the dfs.data.dir list in the node configurations, verify that at least one
of the directories exists, and is writeable by the user running the Hadoop processes. Then
look at the logs.

There's plenty of space, at least 50GB.

• Your DataNode instances have run out of space. Look at the disk capacity via the Namenode
web pages. Delete old files. Compress under-used files. Buy more disks for existing servers
(if there is room), upgrade the existing servers to bigger drives, or add some more servers.

Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.

• The reserved space for a DN (as set in dfs.datanode.du.reserved is greater than the remaining
free space, so the DN thinks it has no free space

I grepped all the files in the conf directory and couldn't find this parameter so I don't
really know anything about it.  At any rate, it seems rather esoteric, I doubt it is related
to my problem.  Any thoughts on this?

• You may also get this message due to permissions, eg if JT can not create jobtracker.info
on startup.

Meh, like I said, the system basicaslly works...and then stops working.  The only explanation
that would really make sense in that context is running out of space...which isn't happening.
If this were a permission error, or a configuration error, or anything weird like that, then
the whole system would never get up and running in the first place.

Why would a properly running hadoop system start exhibiting this error without running out
of disk space?  THAT's the real question on the table here.

Any ideas?

Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
                                           --  Edwin A. Abbott, Flatland

View raw message