hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phillip Wu" <...@helio.com>
Subject RE: Solving the "hang" problem in dfs -copyToLocal/-cat...
Date Wed, 27 Feb 2008 22:56:57 GMT
Very helpful information.

Is there any ways to put files into DFS remotely, like http post?
Or I have to keep using copyFromLocalFile?



mobile . 626.234.7515 . yim . heliophillip
-----Original Message-----
From: C G [mailto:parallelguy@yahoo.com] 
Sent: Wednesday, February 27, 2008 2:46 PM
To: core-user@hadoop.apache.org
Subject: RE: Solving the "hang" problem in dfs -copyToLocal/-cat...

I haven't looked at the source code to see how -cat is implemented, but
I was pretty surprised at the results as well.  When I sat down to do
this experiment I figured I was wasting my time..surprisingly I was not.
  C G

Joydeep Sen Sarma <jssarma@facebook.com> wrote:
  This is amazing ..

Wouldn't dfs -cat use the same dfs client codepath that an actual
map-reduce program would? (If so, should it also start using http client
instead? (at least for the non-local case))

Or maybe it already does?

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, February 27, 2008 12:10 PM
To: core-user@hadoop.apache.org
Subject: Re: Solving the "hang" problem in dfs -copyToLocal/-cat...

Have you tried using http to fetch the file instead?


This will get redirected to one of the datanodes to handle and should be
pretty fast. It would be interesting to find out if this alternative
is subject to the same hangs that you are seeing.

On 2/27/08 12:05 PM, "C G" 

> Hi All:
> The following write-up is offered to help out anybody else who has
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
> One of the performance problems that has been causing big problems
for us
> has been using the dfs commands -copyToLocal and -cat to move data
from HDFS
> to a local file system. We do this in order to populate a data
warehouse that
> is HDFS-unaware.
> The "pattern" I've been using is:
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> bin/hadoop dfs -cat ${x} >> loadfile.dat
> done
> This pattern repeats several times, ultimately cat-ing 353 files
> several load files. This process is extremely slow, often taking
> minutes to transfer 142M of data. More frustrating is that the system
> "pauses" during cat operations. There is no I/O activity, no CPU
> nothing written to the log files on any node. Things just stop. I
> the pattern to use -copyToLocal instead of -cat and had the same
results. We
> observe this "pause" behavior without respect for where the
-copyToLocal or
> -cat originates - I've tried running directly on the grid, and also
> on the DB server which is not part of the grid proper. I've tried
> different releases of Hadoop, including 0.16.0, and all exhibit this
> I decided to try a different approach and use the HTTP interface to
> namenode to transfer the data:
> rm -f loadfile.dat
> fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
> for x in `echo ${fileList}`
> do
> wget -q http://mynamenodeserver:50070/data${x}
> done
> There is a trivial step to merge the individual part files into one
> preparatory for loading data.
> I ran this experiment across 10,850 files containing an aggregate
total of
> 4.6G of data. It ran in under 2 hours, which while not great is
> better than the 18 hours it previously took -copyToLocal/-cat to run.
> I found it surprising that this solution works better than
> -copyToLocal/-cat.
> Hope this helps...
> C G
> ---------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo!

Looking for last minute shopping deals?  Find them fast with Yahoo!

View raw message