hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael E. Driscoll" <m.e.drisc...@gmail.com>
Subject Re: Cloudera 18.3 splits bz2 inputs
Date Tue, 17 Nov 2009 16:23:27 GMT
Kevin Weil, of Twitter, has done some work extending LZO compression to work
with Hadoop streaming.  See

  http://github.com/kevinweil/hadoop-lzo

<http://github.com/kevinweil/hadoop-lzo>MD

On Tue, Nov 17, 2009 at 8:08 AM, Todd Lipcon <todd@cloudera.com> wrote:

> On Tue, Nov 17, 2009 at 7:52 AM, Edward Capriolo <edlinuxguru@gmail.com
> >wrote:
>
> >
> > Todd,
> >
> > I think this is very important. From the grid on "Hadoop the
> > Definative guide" 78, it appears that bzip2 and zip are the only
> > formats the are splittable. As a result bzip2 would be my format of
> > choice to compress my data. In particular I would like to use bzip2 on
> > my hive tables. I can not speak to how IO intensive BZ2 is however I
> > know you can lower the compression threshold to trade off between
> > compression/performance.
> >
> > What other options are out there?
> >
> >
> The other options are currently:
>
> 1) Splittable LZO
>
> You need to add in some external libraries here since LZO is LGPL-licensed
> and thus can't be distributed with Hadoop. I've made some scripts which you
> can use to generate packages compatible with Cloudera's distro here:
> http://github.com/toddlipcon/hadoop-lzo-packager
>
> The scripts are pretty new but there are people running the LZO code in
> production with a lot of success.
>
> Also, to make LZO splittable you have to run an indexing process across
> your
> data one time. I believe the README in the LZO hadoop library source tree
> explains this.
>
> 2) Use a SequenceFile container
>
> If you use SequenceFile for your data, you can turn on block compression
> and
> retain splittability with any codec. The downside of course is that you've
> gotta use some process to get it into this format, but once it's there you
> avoid this issue completely.
>
>
> -Todd
>


-- 
p: 415.860.4347
b: www.dataspora.com/blog
t: www.twitter.com/dataspora

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message