hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: DFS Block Allocation
Date Fri, 21 Dec 2007 04:30:19 GMT
i presume you meant that the act of 'mounting' itself is not bad - but letting the entire cluster
start reading from a hapless filer is :-)
i have actually found it very useful to upload files though map-reduce. we have periodic jobs
that are in effect tailing nfs files and copying data to hdfs. because of random job placement,
data is uniformly distributed. and because we run periodically, we usually don't need more
than a task or two to copy in parallel.
the nice thing is that if we do ever fall behind (network glitches, filer overload, whatever)
- the code automatically increases the number of readers to catch up (with certain bounds
on number of concurrent readers). (something i would have lot more trouble doing outside of
the low hanging fruit we can contribute back are improvements to distcp (wildcards, parallel
transfer of large text files) - but the larger setup is interesting (almost like a self-adjusting
parallel rsync) that probably needs more generalization for wider use.


From: Ted Dunning [mailto:tdunning@veoh.com]
Sent: Thu 12/20/2007 7:12 PM
To: hadoop-user@lucene.apache.org
Subject: Re: DFS Block Allocation

Distcp is a map-reduce program where the maps read the files.  This means
that all of your tasknodes have to be able to read the files in question.

Many times it is easier to have a writer push the files at the cluster,
especially if you are reading data from a conventional unix file system.  It
would be a VERY bad idea to mount an NFS file system on an entire cluster.

On 12/20/07 7:06 PM, "Rui Shi" <shearershot@yahoo.com> wrote:

> Hi,
> I am confused a bit. What is the difference if I use "hadoop distcp" to upload
> files? I assume "hadoop distcp" using multiple trackers to upload files in
> parallel.
> Thanks,
> Rui
> ----- Original Message ----
> From: Ted Dunning <tdunning@veoh.com>
> To: hadoop-user@lucene.apache.org
> Sent: Thursday, December 20, 2007 6:01:50 PM
> Subject: Re: DFS Block Allocation
> On 12/20/07 5:52 PM, "C G" <parallelguy@yahoo.com> wrote:
>>   Ted, when you say "copy in the distro" do you need to include the
>> configuration files from the running grid?  You don't need to
>  actually start
>> HDFS on this node do you?
> You are correct.  You only need the config files (and the hadoop script
> helps make things easier).
>>   If I'm following this approach correctly, I would want to have an
>  "xfer
>> server" whose job it is to essentially run dfs -copyFromLocal on all
>> inbound-to-HDFS data. Once I'm certain that my data has copied
>  correctly, I
>> can delete the local files on the xfer server.
> Yes.
>>   This is great news, as my current system wastes a lot of time
>  copying data
>> from data acquisition servers to the master node. If I can copy to
>> directly from ny acquisition servers then I am a happy guy....
> You are a happy guy.
> If your acquisition systems can see all of your datanodes.
> ______________________________________________________________________________
> ______
> Never miss a thing.  Make Yahoo your home page.
> http://www.yahoo.com/r/hs

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message