hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: has bzip2 compression been deprecated?
Date Mon, 09 Jan 2012 15:34:11 GMT
Bzip2 is pretty slow. You probably do not want to use it, even if it does file splits (a feature
not available in the stable line of 0.20.x/1.x, but available in 0.22+).

To answer your question though, bzip2 was removed from that document cause it isn't a native
library (its pure Java). I think bzip2 was added earlier due to an oversight, as even 0.20
did not have a native bzip2 library. This change in docs does not mean that BZip2 is deprecated
-- it is still fully supported and available in the trunk as well. See https://issues.apache.org/jira/browse/HADOOP-6292
for the doc update changes that led to this.

The best way would be to use either:

(a) Hadoop sequence files with any compression codec of choice (best would be lzo, gz, maybe
even snappy). This file format is built for HDFS and MR and is splittable. Another choice
would be Avro DataFiles from the Apache Avro project.
(b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo (and hadoop-lzo-packager
for packages). This requires you to run indexing operations before the .lzo can be made splittable,
but works great with this extra step added.

On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:

> Hi,
> 
> I'm trying to work out which compression algorithm I should be using in my MapReduce
jobs.  It seems to me that the best solution is a compromise between speed, efficiency and
splittability. The only compression algorithm to handle file splits (according to Hadoop:
The Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of compression speed.
> 
> However, I see from the documentation at http://hadoop.apache.org/common/docs/current/native_libraries.html
that the bzip2 library is no longer mentioned, and hasn't been since version 0.20.0, see http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html
- however the bzip2 Codec is still in the API at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html.
> 
> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> 
> Thanks,
> 
> Tony
> 
> 
> 
> **********************************************************************
> This email and any attachments are confidential, protected by copyright and may be legally
privileged.  If you are not the intended recipient, then the dissemination or copying of this
email is prohibited. If you have received this in error, please notify the sender by replying
by email and then delete the email completely from your system.  Neither Sporting Index nor
the sender accepts responsibility for any virus, or any other defect which might affect any
computer or IT system into which the email is received and/or opened.  It is the responsibility
of the recipient to scan the email and no responsibility is accepted for any loss or damage
arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered
in England and Wales with company number 2636842, whose registered office is at Brookfield
House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised
and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion
contained herein has been issued 
> and approved by Sporting Index Ltd.
> 
> Outbound email has been scanned for viruses and SPAM


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message