hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lohit <lohit...@yahoo.com>
Subject Re: [core] problems while coping files from local file system to dfs
Date Mon, 24 Mar 2008 21:08:00 GMT
If your client use to copy is one of the datanodes, then the first copy would go to this datanode(client)
and second would be on another random nodes in your cluster. This policy is designed to improve
write performance. On the other hand if you would like the data to be distributed, as Ted
pointed out, use a node which is not in your cluster as a datanode. In this case, the first
copy would be placed on a random node in the cluster because your client is not longer a datanode.

Thanks,
Lohit

----- Original Message ----
From: Ted Dunning <tdunning@veoh.com>
To: core-user@hadoop.apache.org; core-dev@hadoop.apache.org
Sent: Monday, March 24, 2008 7:40:06 AM
Subject: Re: [core] problems while coping files from local file system to dfs



Copy from a machine that is *not* running as a data node in order to get
better balancing.  Using distcp may also help because the nodes actually
doing the copying will be spread across the cluster.

You should probably be running a rebalancing script as well if your nodes
have differing sizes.


On 3/24/08 7:35 AM, "Alfonso Olias Sanz" <alfonso.olias.sanz@gmail.com>
wrote:

> Hi
> 
> I want to copy 1000 files (37GB) of data to the dfs.  I have a set up
> of 9-10 nodes, each one has between 5 to 15GB of free space.
> 
> While coping the files from the local file system on nodeA, the node
> gets full of data and the the process gets stalled.
> 
> I have another free node with 80GB of free space. After adding the
> datanode to the cluster, I run again the same copy process
> 
> hadoo dfs  -copyFromLocal ...
> 
> During the copy of these files to the DFS, I have run a java
> application in order to check where the data is located (replication
> level is set to 2)
> 
> String [][] hostnames = dfs.getFileCacheHints(inFile, 0, 100L);
> 
> The output I print is the following
> 
> File name = GASS.0011.63800-0011.63900.zip
> File cache hints =   gaiawl07.net4.lan gaiawl02.net4.lan
> ############################################
> File name = GASS.0011.53100-0011.53200.zip
> File cache hints =   gaiawl03.net4.lan gaiawl02.net4.lan
> ############################################
> File name = GASS.0011.23800-0011.23900.zip
> File cache hints =   gaiawl08.net4.lan gaiawl02.net4.lan
> ############################################
> File name = GASS.0011.18800-0011.18900.zip
> File cache hints =   gaiawl02.net4.lan gaiawl06.net4.lan
> ....
> 
> In these small sample  gaiawl02.net4.lan appears for every file, and
> this is currently happening for every copied file.    I launch the
> copy process from that machine which is also the one which has 80GB of
> free space.  I did this because of the problem I pointed previously of
> filling up a node and stalling the copy operation.
> 
> Shouldn't be the data dispersed in all the nodes, because if that data
> node crashes, only 1 replica of the data is going to exist at the
> cluster.
> 
> During the "staging" phase I understand that that perticulary node
> contains a local copy of the file being added to the HDFS. But once a
> block is filled this doesn't mean that the block has to be also on
> that node. Am I right?
> 
> Is it possible to spread the data among all the data nodes to avoid
> that a node keeps 1 replica of every copied file?
> 
> thanks






Mime
View raw message