hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Solving the "hang" problem in dfs -copyToLocal/-cat...
Date Wed, 27 Feb 2008 20:10:22 GMT

Ooops.  Should have read the rest of your posting.  Sorry about the noise.


On 2/27/08 12:05 PM, "C G" <parallelguy@yahoo.com> wrote:

> Hi All:
>    
>   The following write-up is offered to help out anybody else who has seen
> performance problems and "hangs" while using dfs -copyToLocal/-cat.
>    
>   One of the performance problems that has been causing big problems for us
> has been using the dfs commands -copyToLocal and -cat to move data from HDFS
> to a local file system.  We do this in order to populate a data warehouse that
> is HDFS-unaware.
>    
>   The "pattern" I've been using is:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>      bin/hadoop dfs -cat ${x} >> loadfile.dat
>   done
>    
>   This pattern repeats several times, ultimately cat-ing 353 files into
> several load files.  This process is extremely slow, often taking 20-30
> minutes to transfer 142M of data.  More frustrating is that the system simply
> "pauses" during cat operations.  There is no I/O activity, no CPU activity,
> nothing written to the log files on any node.  Things just stop.  I changed
> the pattern to use -copyToLocal instead of -cat and had the same results.  We
> observe this "pause" behavior without respect for where the -copyToLocal or
> -cat originates - I've tried running directly on the grid, and also directly
> on the DB server which is not part of the grid proper.  I've tried many
> different releases of Hadoop, including 0.16.0, and all exhibit this problem.
>    
>   I decided to try a different approach and use the HTTP interface to the
> namenode to transfer the data:
>    
>   rm -f loadfile.dat
>   fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
>   for x in `echo ${fileList}`
>   do
>    wget -q http://mynamenodeserver:50070/data${x}
>   done
>    
>   There is a trivial step to merge the individual part files into one file
> preparatory for loading data.
>    
>   I ran this experiment across 10,850 files containing an aggregate total of
> 4.6G of data.  It ran in under 2 hours, which while not great is significantly
> better than the 18 hours it previously took -copyToLocal/-cat to run.
>    
>   I found it surprising that this solution works better than
> -copyToLocal/-cat.
>    
>   Hope this helps...
>   C G
>    
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.


Mime
View raw message