hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Cloudera 18.3 splits bz2 inputs
Date Tue, 17 Nov 2009 15:52:01 GMT
On Tue, Nov 17, 2009 at 3:36 AM, Usman Waheed <usmanw@opera.com> wrote:
> Thanks Todd,
> -Usman
>>
>> Hi Usman/Mike,
>>
>> This feature is slated for 0.21 (not 0.20.1)
>>
>> We have not backported it into Cloudera's release of 0.20.1, though we'll
>> certainly consider doing so if there appears to be demand for it in the
>> community. Anecdotally we've seen that not too many people are using bzip2
>> since the CPU overhead is high enough that it's not worth the space
>> savings.
>>
>> -Todd
>>
>> On Sat, Nov 14, 2009 at 10:30 AM, Mike Kendall <mkendall@justin.tv> wrote:
>>
>>
>>>
>>> it's gonna be in 20.1...  :(
>>>
>>> On Sat, Nov 14, 2009 at 12:34 AM, Usman Waheed <usmanw@opera.com> wrote:
>>>
>>>
>>>>
>>>> Hi,
>>>>
>>>> I was under the impression that Cloudera's 18.3 can split bz2 input logs
>>>> during the map phase, is that not so?
>>>> As of now i see each bz2 file being processed in one entire map task in
>>>>
>>>
>>> my
>>>
>>>>
>>>> running jobs.
>>>> Maybe i am missing something here.
>>>>
>>>> Thanks,
>>>> Usman
>>>>
>>>> --
>>>> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>>>>
>>>>
>>
>>
>
>

Todd,

I think this is very important. From the grid on "Hadoop the
Definative guide" 78, it appears that bzip2 and zip are the only
formats the are splittable. As a result bzip2 would be my format of
choice to compress my data. In particular I would like to use bzip2 on
my hive tables. I can not speak to how IO intensive BZ2 is however I
know you can lower the compression threshold to trade off between
compression/performance.

What other options are out there?

Edward

Mime
View raw message