hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From C G <parallel...@yahoo.com>
Subject Solving the "hang" problem in dfs -copyToLocal/-cat...
Date Wed, 27 Feb 2008 20:05:57 GMT
Hi All:
   
  The following write-up is offered to help out anybody else who has seen performance problems
and "hangs" while using dfs -copyToLocal/-cat.
   
  One of the performance problems that has been causing big problems for us has been using
the dfs commands -copyToLocal and -cat to move data from HDFS to a local file system.  We
do this in order to populate a data warehouse that is HDFS-unaware.
   
  The "pattern" I've been using is:
   
  rm -f loadfile.dat
  fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
  for x in `echo ${fileList}`
  do
     bin/hadoop dfs -cat ${x} >> loadfile.dat
  done
   
  This pattern repeats several times, ultimately cat-ing 353 files into several load files.
 This process is extremely slow, often taking 20-30 minutes to transfer 142M of data.  More
frustrating is that the system simply "pauses" during cat operations.  There is no I/O activity,
no CPU activity, nothing written to the log files on any node.  Things just stop.  I changed
the pattern to use -copyToLocal instead of -cat and had the same results.  We observe this
"pause" behavior without respect for where the -copyToLocal or -cat originates - I've tried
running directly on the grid, and also directly on the DB server which is not part of the
grid proper.  I've tried many different releases of Hadoop, including 0.16.0, and all exhibit
this problem.
   
  I decided to try a different approach and use the HTTP interface to the namenode to transfer
the data:
   
  rm -f loadfile.dat
  fileList=`bin/hadoop dfs -ls /foo | grep part | awk '{print $1}'`
  for x in `echo ${fileList}`
  do
   wget -q http://mynamenodeserver:50070/data${x}
  done
   
  There is a trivial step to merge the individual part files into one file preparatory for
loading data.
   
  I ran this experiment across 10,850 files containing an aggregate total of 4.6G of data.
 It ran in under 2 hours, which while not great is significantly better than the 18 hours
it previously took -copyToLocal/-cat to run. 
   
  I found it surprising that this solution works better than -copyToLocal/-cat. 
   
  Hope this helps...
  C G
   

       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message