hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Even HDFS data distribution
Date Fri, 27 Nov 2009 22:01:49 GMT
Hi Igor,

If you can install hadoop (the client) on a node outside your cluster, and
do the hadoop fs -copyFromLocal on that outside node, it will load balance
evenly in your cluster. Of course the outside node will have to have network
access to all of the datanodes, so this may not work for you if you've got
firewall rules in place there.

The other option is to keep doing what you're doing, but be sure to
periodically run the DFS balancer. The balancer ensures that the DFS usage
is spread evenly across your datanodes.

Hope that helps
-Todd

On Fri, Nov 27, 2009 at 12:00 PM, Igor Katkov <ikatkov@gmail.com> wrote:

> Hi,
>
> What is the usual approach/techniques to achieve even HDFS data
> distribution.
> I have a bunch of files (logs) outside of HDFS, if I copy them all to
> a node within HDFS and then do something like
>
> ./hadoop fs -copyFromLocal /mnt/accesslog-agregated.2009-10-04.log /logs
>
> it would write block locally first and then to some other node.
> If I do that 100 times, most of the data will be sitting on the host I
> doing these operations on.
>
> It would be nice, to pick a host a random and store the very first block
> there.
> Immediately I can see only one workaround - manually split these log
> files in as many sets as many HDFS nodes I have. Upload/scp them to
> HDFS nodes and then ./hadoop fs -copyFromLocal
> This surely is a lot of manual work, so I guess there must be a trick
> to make it happen with much less hassle.
>
> Ideas?
>
> P.S. I googled it, but did not find any relative discussions.
>

Mime
View raw message