hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject Re: Compression using Hadoop...
Date Fri, 31 Aug 2007 19:58:50 GMT
One thing I had done to speed up copy/put speeds was write a simple
map-reduce job to do parallel copies of files from a input directory (in
our case the input directory is nfs mounted from all task nodes). It
gives us a huge speed-bump.

It's trivial to roll ur own - but would be happy to share as well.


-----Original Message-----
From: C G [mailto:parallelguy@yahoo.com] 
Sent: Friday, August 31, 2007 11:21 AM
To: hadoop-user@lucene.apache.org
Subject: RE: Compression using Hadoop...

My input is typical row-based stuff across which are run a large stack
of aggregations/rollups.  After reading earlier posts on this thread, I
modified my loader to split the input up into 1M row partitions
(literally gunzip -cd input.gz | split...).  I then ran an experiment
using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster.
Ted, from what you are saying I should be using at least 80 files given
the cluster size, and I should modify the loader to be aware of the
number of nodes and split accordingly. Do you concur?
   
  Load time to HDFS may be the next challenge.  My HDFS configuration on
8 nodes uses a replication factor of 3.  Sequentially copying my data to
HDFS using -copyFromLocal took 23 minutes to move 266M in individual
files of 5.7M each.  Does anybody find this result surprising?  Note
that this is on EC2, where there is no such thing as rack-level or
switch-level locality.  Should I expect dramatically better performance
on a real iron?  Once I get this prototyping/education under my belt my
plan is to deploy a 64 node grid of 4 way machines with a terabyte of
local storage on each node.
   
  Thanks for the discussion...the Hadoop community is very helpful!
   
  C G 
    

Ted Dunning <tdunning@veoh.com> wrote:
  
They will only be a non-issue if you have enough of them to get the
parallelism you want. If you have number of gzip files > 10*number of
task nodes you should be fine.


-----Original Message-----
From: jason.gessner@gmail.com on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...

ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.

-jason

On 8/31/07, C G 
wrote:
> Thanks Ted and Jason for your comments. Ted, your comments about gzip
not being splittable was very timely...I'm watching my 8 node cluster
saturate one node (with one gz file) and was wondering why. Thanks for
the "answer in advance" :-).
>
> Ted Dunning wrote:
> With gzipped files, you do face the problem that your parallelism in
the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since
most
> people can arrange to have dozens to hundreds of input files easier
than
> they can arrange to have dozens to hundreds of CPU cores working on
their
> data.


       
---------------------------------
Luggage? GPS? Comic books? 
Check out fitting  gifts for grads at Yahoo! Search.

Mime
View raw message