hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Segel, Mike" <mse...@navteq.com>
Subject RE: Hadoop Compression - Current Status
Date Wed, 14 Jul 2010 15:26:14 GMT
Sorry for the delay in responding back...

Yes, that's kind of my point. 

You gain some efficiency, however... currently you have an expense of losing your parallelism
which really gives you more bang for your buck.

I'm not sure what I can say about stuff going on at my current client, but I can say the following...

We're storing records in HBase using a SHA-1 hash as the record key. So we're getting good
distribution across the cloud when the tables get large.

So suppose we're running a job where we want to run a process that accesses 100K records.
If the table only contains those 100K records, we have fewer region servers so we have fewer
splits.
If the table contains 15 million rows, and we still want to only process those 100K records,
we'll get more splits, and better utilization of the cloud.

Granted that this is HBase and not strictly hadoop, but the point remains the same. You become
more efficient through parallelism and when you restrict your ability to run m/r tasks in
parallel, your overall time is constrained.

So until you get MAPREDUCE-491 or the hadoop-lzo input formats, I think Stephen's assertion
is incorrect.

Now while this is a bit of a nit, because Stephen seems to be concerned about a 'poisoned
GPL', his comment about performance is incorrect.

It seems your performance is going to be better not using something that restricts your #
of m/r tasks.

-Mike


-----Original Message-----
From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
Sent: Monday, July 12, 2010 2:13 PM
To: common-dev@hadoop.apache.org
Subject: Re: Hadoop Compression - Current Status

Also, fwiw, the use of codecs and SequenceFiles are somewhat orthogonal.
You'll have to compress the sequencefile with a codec, be it gzip, bz2 or
lzo. SequenceFiles do get you splittability which you won't get with just
Gzip (until we get MAPREDUCE-491) or the hadoop-lzo InputFormats.

cheers,

- Patrick

On Mon, Jul 12, 2010 at 2:42 PM, Segel, Mike <msegel@navteq.com> wrote:

> How can you say zip files are 'best codecs' to use?
>
> Call me silly but I seem to recall that if you're using a zip'd file for
> input you can't really use a file splitter?
> (Going from memory, which isn't the best thing to do...)
>
> -Mike
>
>
> -----Original Message-----
> From: Stephen Watt [mailto:swatt@us.ibm.com]
> Sent: Monday, July 12, 2010 1:28 PM
> To: common-dev@hadoop.apache.org
> Subject: Hadoop Compression - Current Status
>
> Please let me know if any of assertions are incorrect. I'm going to be
> adding any feedback to the Hadoop Wiki. It seems well documented that the
> LZO Codec is the most performant codec (
> http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html)
> but it is GPL infected and thus it is separately maintained here -
> http://github.com/kevinweil/hadoop-lzo.
>
> With regards to performance, and if you are not using sequential files,
> Gzip is the next best codec to use, followed by bzip2. Hadoop has
> supported being able to process bzip2 and gzip input formats for awhile
> now but it could never split the files. i.e. it assigned one mapper per
> file. There are now 2 new features :
> - Splitting bzip2 files available in 0.21.0 -
> https://issues.apache.org/jira/browse/HADOOP-4012
> - Splitting gzip files (in progress but patch available) -
> https://issues.apache.org/jira/browse/MAPREDUCE-491
>
> 1) It appears most folks are using LZO. Given that it is GPL, are you not
> worried about it virally infecting your project ?
> 2) Is anyone using the new bzip2 or gzip file split compatible readers?
> How do you like them? General feedback?
>
> Kind regards
> Steve Watt
>
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above.  If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited.  If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>


The information contained in this communication may be CONFIDENTIAL and is intended only for
the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby
notified that any dissemination, distribution, or copying of this communication, or any of
its contents, is strictly prohibited.  If you have received this communication in error, please
notify the sender and delete/destroy the original message and any copy of it from your computer
or paper files.

Mime
View raw message