hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy.had...@gmail.com>
Subject Re: has bzip2 compression been deprecated?
Date Mon, 09 Jan 2012 17:33:07 GMT
Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;


Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <wget.null@googlemail.com> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 09 January 2012 15:34
> >> To: common-user@hadoop.apache.org
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS
> and MR and is splittable. Another choice would be Avro DataFiles from the
> Apache Avro project.
> >> (b) LZO codecs for Hadoop, via https://github.com/toddlipcon/hadoop-lzo(and
hadoop-lzo-packager for packages). This requires you to run indexing
> operations before the .lzo can be made splittable, but works great with
> this extra step added.
> >>
> >> On 09-Jan-2012, at 7:17 PM, Tony Burton wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm trying to work out which compression algorithm I should be using
> in my MapReduce jobs.  It seems to me that the best solution is a
> compromise between speed, efficiency and splittability. The only
> compression algorithm to handle file splits (according to Hadoop: The
> Definitive Guide 2nd edition p78 etc) is bzip2, at the expense of
> compression speed.
> >>>
> >>> However, I see from the documentation at
> http://hadoop.apache.org/common/docs/current/native_libraries.html that
> the bzip2 library is no longer mentioned, and hasn't been since version
> 0.20.0, see
> http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html -
> however the bzip2 Codec is still in the API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html
> .
> >>>
> >>> Has bzip2 support been removed from Hadoop, or will it be removed soon?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> **********************************************************************
> >>> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Brookfield House, Green Lane, Ivinghoe,
> Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated
> by the UK Financial Services Authority (reg. no. 150404). Any financial
> promotion contained herein has been issued
> >>> and approved by Sporting Index Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> **********************************************************************
> >> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton
> Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK
> Financial Services Authority (reg. no. 150404). Any financial promotion
> contained herein has been issued
> >> and approved by Sporting Index Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message