hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alfonso Olias Sanz" <alfonso.olias.s...@gmail.com>
Subject Re: [core] problems while coping files from local file system to dfs
Date Mon, 24 Mar 2008 22:37:05 GMT
Yes I did the test and it worked. Also I run the distfs command and
the parallel map/reduce copy.

Improves performance on the basis that files are copied locally in
that node, so there is no need network transmission. But isn't that
policy more weak?  If that node crashes ( he worst case), you loses 1
redundancy level.

Would the HDFS start automatically a  replication process in case such
problem happens? In order to keep the replication level
Thanks


On 24/03/2008, lohit <lohit_bv@yahoo.com> wrote:
> If your client use to copy is one of the datanodes, then the first copy would go to this
datanode(client) and second would be on another random nodes in your cluster. This policy
is designed to improve write performance. On the other hand if you would like the data to
be distributed, as Ted pointed out, use a node which is not in your cluster as a datanode.
In this case, the first copy would be placed on a random node in the cluster because your
client is not longer a datanode.
>
>  Thanks,
>
> Lohit
>
>
>  ----- Original Message ----
>  From: Ted Dunning <tdunning@veoh.com>
>  To: core-user@hadoop.apache.org; core-dev@hadoop.apache.org
>  Sent: Monday, March 24, 2008 7:40:06 AM
>  Subject: Re: [core] problems while coping files from local file system to dfs
>
>
>
>  Copy from a machine that is *not* running as a data node in order to get
>  better balancing.  Using distcp may also help because the nodes actually
>  doing the copying will be spread across the cluster.
>
>  You should probably be running a rebalancing script as well if your nodes
>  have differing sizes.
>
>
>  On 3/24/08 7:35 AM, "Alfonso Olias Sanz" <alfonso.olias.sanz@gmail.com>
>  wrote:
>
>  > Hi
>  >
>  > I want to copy 1000 files (37GB) of data to the dfs.  I have a set up
>  > of 9-10 nodes, each one has between 5 to 15GB of free space.
>  >
>  > While coping the files from the local file system on nodeA, the node
>  > gets full of data and the the process gets stalled.
>  >
>  > I have another free node with 80GB of free space. After adding the
>  > datanode to the cluster, I run again the same copy process
>  >
>  > hadoo dfs  -copyFromLocal ...
>  >
>  > During the copy of these files to the DFS, I have run a java
>  > application in order to check where the data is located (replication
>  > level is set to 2)
>  >
>  > String [][] hostnames = dfs.getFileCacheHints(inFile, 0, 100L);
>  >
>  > The output I print is the following
>  >
>  > File name = GASS.0011.63800-0011.63900.zip
>  > File cache hints =   gaiawl07.net4.lan gaiawl02.net4.lan
>  > ############################################
>  > File name = GASS.0011.53100-0011.53200.zip
>  > File cache hints =   gaiawl03.net4.lan gaiawl02.net4.lan
>  > ############################################
>  > File name = GASS.0011.23800-0011.23900.zip
>  > File cache hints =   gaiawl08.net4.lan gaiawl02.net4.lan
>  > ############################################
>  > File name = GASS.0011.18800-0011.18900.zip
>  > File cache hints =   gaiawl02.net4.lan gaiawl06.net4.lan
>  > ....
>  >
>  > In these small sample  gaiawl02.net4.lan appears for every file, and
>  > this is currently happening for every copied file.    I launch the
>  > copy process from that machine which is also the one which has 80GB of
>  > free space.  I did this because of the problem I pointed previously of
>  > filling up a node and stalling the copy operation.
>  >
>  > Shouldn't be the data dispersed in all the nodes, because if that data
>  > node crashes, only 1 replica of the data is going to exist at the
>  > cluster.
>  >
>  > During the "staging" phase I understand that that perticulary node
>  > contains a local copy of the file being added to the HDFS. But once a
>  > block is filled this doesn't mean that the block has to be also on
>  > that node. Am I right?
>  >
>  > Is it possible to spread the data among all the data nodes to avoid
>  > that a node keeps 1 replica of every copied file?
>  >
>  > thanks
>
>
>
>
>
>

Mime
View raw message