hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Katkov <ikat...@gmail.com>
Subject Even HDFS data distribution
Date Fri, 27 Nov 2009 20:00:05 GMT

What is the usual approach/techniques to achieve even HDFS data distribution.
I have a bunch of files (logs) outside of HDFS, if I copy them all to
a node within HDFS and then do something like

./hadoop fs -copyFromLocal /mnt/accesslog-agregated.2009-10-04.log /logs

it would write block locally first and then to some other node.
If I do that 100 times, most of the data will be sitting on the host I
doing these operations on.

It would be nice, to pick a host a random and store the very first block there.
Immediately I can see only one workaround - manually split these log
files in as many sets as many HDFS nodes I have. Upload/scp them to
HDFS nodes and then ./hadoop fs -copyFromLocal
This surely is a lot of manual work, so I guess there must be a trick
to make it happen with much less hassle.


P.S. I googled it, but did not find any relative discussions.

View raw message