hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Fellows <chrisc_fell...@yahoo.com>
Subject Re: multiple file -put in dfs
Date Fri, 02 Nov 2007 15:53:32 GMT
Thanks for the response. I ended up using the distcp which I felt worked well and was quite
straight forward. but as the source machine was part of the cluster, I did end up with fairly
high imbalance. Ted noted several ways of balancing the cluster using replication. Are there
any plans of introducing automatic balancing, so that in idle time, the namenode can balance
out its nodes?

>You also have to watch out if you start writing from a host in your
 cluster
>else you will wind up with odd imbalances in file storage.  In my case,
 the
>source of the data is actually outside of the cluster and I get pretty
 good
>balancing.

>If you do wind up with bad balancing, the best option I have seen is to
>increase the replication on individual files for 30-60 seconds and then
>decrease it again.  In order to get sufficient throughput for the
>rebalancing, I pipeline lots of these changes so that I have 10-100
 files at
>a time with higher replication.  This does tend to substantially
 increase
>the number of files with excess replication, but that corrects itself
 pretty
>quickly.

----- Original Message ----
From: Ted Dunning <tdunning@veoh.com>
To: hadoop-user@lucene.apache.org
Sent: Wednesday, October 31, 2007 5:48:54 PM
Subject: Re: multiple file -put in dfs



This only handles the problem of putting lots of files.  It doesn't
 deal
with putting files in parallel (at once).

This is a ticklish problem since even on a relatively small cluster,
 dfs has
a higher read speed than most storage can read.  That means that you
 can
swamp things pretty easily.

When I have files on a single source machine, I just spawn multiple
 -put's
on sub-directories until I have sufficiently saturated the read speed
 of the
source.  If all of the cluster members have access to a universal file
system, then you can use the (undocumented) pdist command, but I don't
 like
that as much.

You also have to watch out if you start writing from a host in your
 cluster
else you will wind up with odd imbalances in file storage.  In my case,
 the
source of the data is actually outside of the cluster and I get pretty
 good
balancing.

If you do wind up with bad balancing, the best option I have seen is to
increase the replication on individual files for 30-60 seconds and then
decrease it again.  In order to get sufficient throughput for the
rebalancing, I pipeline lots of these changes so that I have 10-100
 files at
a time with higher replication.  This does tend to substantially
 increase
the number of files with excess replication, but that corrects itself
 pretty
quickly.


On 10/31/07 1:53 PM, "Aaron Kimball" <ak@cs.washington.edu> wrote:

> hadoop dfs -put will take a directory. If it won't work recursively,
> then you can probably bang out a bash script that will handle it
 using
> find(1) and xargs(1).
> 
> -- Aaron
> 
> Chris Fellows wrote:
>> Hello!
>> 
>> Quick simple question, hopefully someone out there could answer.
>> 
>> Does the hadoop dfs support putting multiple files at once?
>> 
>> The documentation says -put only works on one file. What's the best
 way to
>> import multiple files in multiple directories (i.e. dir1/file1
 dir1/file2
>> dir2/file1 dir2/file2 etc)?
>> 
>> End goal would be to do something like:
>> 
>>     bin/hadoop dfs -put /dir*/file* /myfiles
>> 
>> And a follow-up: bin/hadoop dfs -lsr /myfiles
>> would list:
>> 
>> /myfiles/dir1/file1
>> /myfiles/dir1/file2
>> /myfiles/dir2/file1
>> /myfiles/dir2/file2
>> 
>> Thanks again for any input!!!
>> 
>> - chris
>> 
>> 
>>   





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message