hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Cloudera 18.3 splits bz2 inputs
Date Tue, 17 Nov 2009 16:33:09 GMT
@Todd

1) Splittable LZO --- 2) Use a SequenceFile container

Between the licensing, patching, and indexing, this seems to be very
challenging. Also how do these things fit into my hive usage, it is
not very clear. After I generate a Hive table using LZO what process
runs the indexing?

Take a look at this thread:

http://www.mail-archive.com/common-user@hadoop.apache.org/msg00337.html

I really like this! Easy! I set some hive parameters, and whamo! Compression!

I think support for LZO is really great, but I it seems like bz2 works
almost out of the box, sometimes. I think in my current setup no
compression option works out of the box, I would rather have a slow
BZ2 option then no option.

Edward



On Tue, Nov 17, 2009 at 11:23 AM, Michael E. Driscoll
<m.e.driscoll@gmail.com> wrote:
> Kevin Weil, of Twitter, has done some work extending LZO compression to work
> with Hadoop streaming.  See
>
>  http://github.com/kevinweil/hadoop-lzo
>
> <http://github.com/kevinweil/hadoop-lzo>MD
>
> On Tue, Nov 17, 2009 at 8:08 AM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> On Tue, Nov 17, 2009 at 7:52 AM, Edward Capriolo <edlinuxguru@gmail.com
>> >wrote:
>>
>> >
>> > Todd,
>> >
>> > I think this is very important. From the grid on "Hadoop the
>> > Definative guide" 78, it appears that bzip2 and zip are the only
>> > formats the are splittable. As a result bzip2 would be my format of
>> > choice to compress my data. In particular I would like to use bzip2 on
>> > my hive tables. I can not speak to how IO intensive BZ2 is however I
>> > know you can lower the compression threshold to trade off between
>> > compression/performance.
>> >
>> > What other options are out there?
>> >
>> >
>> The other options are currently:
>>
>> 1) Splittable LZO
>>
>> You need to add in some external libraries here since LZO is LGPL-licensed
>> and thus can't be distributed with Hadoop. I've made some scripts which you
>> can use to generate packages compatible with Cloudera's distro here:
>> http://github.com/toddlipcon/hadoop-lzo-packager
>>
>> The scripts are pretty new but there are people running the LZO code in
>> production with a lot of success.
>>
>> Also, to make LZO splittable you have to run an indexing process across
>> your
>> data one time. I believe the README in the LZO hadoop library source tree
>> explains this.
>>
>> 2) Use a SequenceFile container
>>
>> If you use SequenceFile for your data, you can turn on block compression
>> and
>> retain splittability with any codec. The downside of course is that you've
>> gotta use some process to get it into this format, but once it's there you
>> avoid this issue completely.
>>
>>
>> -Todd
>>
>
>
> --
> p: 415.860.4347
> b: www.dataspora.com/blog
> t: www.twitter.com/dataspora
>

Mime
View raw message