hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alfonso Olias Sanz" <alfonso.olias.s...@gmail.com>
Subject [core] problems while coping files from local file system to dfs
Date Mon, 24 Mar 2008 14:35:12 GMT
Hi

I want to copy 1000 files (37GB) of data to the dfs.  I have a set up
of 9-10 nodes, each one has between 5 to 15GB of free space.

While coping the files from the local file system on nodeA, the node
gets full of data and the the process gets stalled.

I have another free node with 80GB of free space. After adding the
datanode to the cluster, I run again the same copy process

hadoo dfs  -copyFromLocal ...

During the copy of these files to the DFS, I have run a java
application in order to check where the data is located (replication
level is set to 2)

String [][] hostnames = dfs.getFileCacheHints(inFile, 0, 100L);

The output I print is the following

File name = GASS.0011.63800-0011.63900.zip
File cache hints =   gaiawl07.net4.lan gaiawl02.net4.lan
############################################
File name = GASS.0011.53100-0011.53200.zip
File cache hints =   gaiawl03.net4.lan gaiawl02.net4.lan
############################################
File name = GASS.0011.23800-0011.23900.zip
File cache hints =   gaiawl08.net4.lan gaiawl02.net4.lan
############################################
File name = GASS.0011.18800-0011.18900.zip
File cache hints =   gaiawl02.net4.lan gaiawl06.net4.lan
....

In these small sample  gaiawl02.net4.lan appears for every file, and
this is currently happening for every copied file.    I launch the
copy process from that machine which is also the one which has 80GB of
free space.  I did this because of the problem I pointed previously of
filling up a node and stalling the copy operation.

Shouldn't be the data dispersed in all the nodes, because if that data
node crashes, only 1 replica of the data is going to exist at the
cluster.

During the "staging" phase I understand that that perticulary node
contains a local copy of the file being added to the HDFS. But once a
block is filled this doesn't mean that the block has to be also on
that node. Am I right?

Is it possible to spread the data among all the data nodes to avoid
that a node keeps 1 replica of every copied file?

thanks

Mime
View raw message