hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From C G <parallel...@yahoo.com>
Subject RE: Compression using Hadoop...
Date Fri, 31 Aug 2007 18:21:26 GMT
My input is typical row-based stuff across which are run a large stack of aggregations/rollups.
 After reading earlier posts on this thread, I modified my loader to split the input up into
1M row partitions (literally gunzip -cd input.gz | split...).  I then ran an experiment using
50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster. Ted, from what you are saying
I should be using at least 80 files given the cluster size, and I should modify the loader
to be aware of the number of nodes and split accordingly. Do you concur?
  Load time to HDFS may be the next challenge.  My HDFS configuration on 8 nodes uses a replication
factor of 3.  Sequentially copying my data to HDFS using -copyFromLocal took 23 minutes to
move 266M in individual files of 5.7M each.  Does anybody find this result surprising?  Note
that this is on EC2, where there is no such thing as rack-level or switch-level locality.
 Should I expect dramatically better performance on a real iron?  Once I get this prototyping/education
under my belt my plan is to deploy a 64 node grid of 4 way machines with a terabyte of local
storage on each node.
  Thanks for the discussion...the Hadoop community is very helpful!
  C G 

Ted Dunning <tdunning@veoh.com> wrote:
They will only be a non-issue if you have enough of them to get the parallelism you want.
If you have number of gzip files > 10*number of task nodes you should be fine.

-----Original Message-----
From: jason.gessner@gmail.com on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...

ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.


On 8/31/07, C G 
> Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable
was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and
was wondering why. Thanks for the "answer in advance" :-).
> Ted Dunning wrote:
> With gzipped files, you do face the problem that your parallelism in the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since most
> people can arrange to have dozens to hundreds of input files easier than
> they can arrange to have dozens to hundreds of CPU cores working on their
> data.

Luggage? GPS? Comic books? 
Check out fitting  gifts for grads at Yahoo! Search.
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message