hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Broberg <Tim.Brob...@exar.com>
Subject RE: Input split for a streaming job!
Date Fri, 11 Nov 2011 18:53:29 GMT
Or you could use the LZO patch and get *fast* splittable compression that doesn't depend on
the bz2 generalized splittability scheme:

http://www.cloudera.com/blog/2009/06/parallel-lzo-splittable-compression-for-hadoop/
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

    - Tim.
________________________________________
From: bejoy.hadoop@gmail.com [bejoy.hadoop@gmail.com]
Sent: Friday, November 11, 2011 10:44 AM
To: common-user@hadoop.apache.org; Raj V; Tim Broberg
Subject: Re: Input split for a streaming job!

Hi Raj
       AFAIK 0.21is an unstable release and I fear anyone would recommend that for production.
You can play around with the same, a better approach would be patching your CDH3u1 with the
required patches for splittable BZip2, but make sure that your new patch doesn't break anything
else.

Regards
Bejoy K S

-----Original Message-----
From: Raj V <rajvish@yahoo.com>
Date: Fri, 11 Nov 2011 10:34:18
To: Tim Broberg<Tim.Broberg@exar.com>; common-user@hadoop.apache.org<common-user@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: Re: Input split for a streaming job!

Tim

I  am using CDH3 U1. ( 0.20.2+923) which does not have the patch.

I will try and use 0.21

Raj



>________________________________
>From: Tim Broberg <Tim.Broberg@exar.com>
>To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>; Raj V <rajvish@yahoo.com>;
Joey Echeverria <joey@cloudera.com>
>Sent: Friday, November 11, 2011 10:25 AM
>Subject: RE: Input split for a streaming job!
>
>
>
>What version of hadoop are you using?
>
>We just stumbled on the Jira item for BZIP2 splitting, and it appears to have been added
in 0.21.
>
>When I diff 0.20.205 vs trunk, I see
>< public class BZip2Codec implements
>><     org.apache.hadoop.io.compress.CompressionCodec {
>>---
>>> @InterfaceAudience.Public
>>> @InterfaceStability.Evolving
>>> public class BZip2Codec implements SplittableCompressionCodec {
>So, it appears you need at least 0.21 to play with splittability in BZIP2.
>
>     - Tim.
>
>________________________________________
>From: Raj V [rajvish@yahoo.com]
>Sent: Friday, November 11, 2011 9:18 AM
>To: Joey Echeverria
>Cc: common-user@hadoop.apache.org
>Subject: Re: Input split for a streaming job!
>
>Joey,Anirudh, Bejoy
>
>I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).
>
>And the input files were created using 32MB block size and the files are bzip2.
>
>So all things point to my input files being spliitable.
>
>I  will continue poking around.
>
>- best regards
>
>Raj
>
>
>
>>________________________________
>>From: Joey Echeverria <joey@cloudera.com>
>>To: Raj V <rajvish@yahoo.com>
>>Sent: Friday, November 11, 2011 2:56 AM
>>Subject: Re: Input split for a streaming job!
>>
>>U1 should be able to split the bzip2 files. What input format are you using?
>>
>>-Joey
>>
>>On Thu, Nov 10, 2011 at 9:06 PM, Raj V <rajvish@yahoo.com> wrote:
>>> Sorry to bother you offline.
>>> From the release notes for CDH3U1
>>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>>> I understand that split of the bzip files was available.
>>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>>> something?
>>> If necessary, I can re-post the mail to the group.,
>>>
>>> ________________________________
>>> From: Joey Echeverria <joey@cloudera.com>
>>> To: rajvish@yahoo.com
>>> Sent: Thursday, November 10, 2011 3:11 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> No problem. Out of curiosity, why are you still using B3?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V <rajvish@yahoo.com> wrote:
>>>> Joey
>>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>>> not
>>>> seem to support bzip splitting. I should have looked before shooting off
>>>> the
>>>> email :-(
>>>> To answer your second question, I created a completely new set of input
>>>> files with dfs.block.size=32MB and used this as the input data
>>>> Raj
>>>>
>>>>
>>>> ________________________________
>>>> From: Joey Echeverria <joey@cloudera.com>
>>>> To: cdh-user@cloudera.org
>>>> Sent: Thursday, November 10, 2011 3:02 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> It depends on the version of hadoop that you're using. Also, when you
>>>> changed the block size, did you do it on the actual files, or just the
>>>> default for new files?
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V <rajvish@yahoo.com> wrote:
>>>>> Hi Joey,
>>>>> I always thought bzip was splittable.
>>>>> Raj
>>>>>
>>>>> ________________________________
>>>>> From: Joey Echeverria <joey@cloudera.com>
>>>>> To: cdh-user@cloudera.org
>>>>> Sent: Thursday, November 10, 2011 2:43 PM
>>>>> Subject: Re: Input split for a streaming job!
>>>>>
>>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always
>>>>> get one mapper per file.
>>>>>
>>>>> -Joey
>>>>>
>>>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V <rajvish@yahoo.com> wrote:
>>>>>> All
>>>>>> I assumed that the input splits for a streaming job will follow the
same
>>>>>> logic as a map reduce java job but I seem to be wrong.
>>>>>> I started out with 73 gzipped files that vary between 23MB to 255MB
in
>>>>>> size.
>>>>>> My default block size was 128MB.  8 of the 73 files are larger than
128
>>>>>> MB
>>>>>> When I ran my streaming job, it ran, as expected,  73 mappers ( No
>>>>>> reducers
>>>>>> for this job).
>>>>>> Since I have 128 Nodes in my cluster , I thought I would use more
>>>>>> systems
>>>>>> in
>>>>>> the cluster by increasing the number of mappers. I changed all the
gzip
>>>>>> files into bzip2 files. I expected the number of mappers to increase
to
>>>>>> 81.
>>>>>> The mappers remained at 73.
>>>>>> I tried a second experiment- I changed my dfs.block.size to 32MB.
That
>>>>>> should have increased my mappers to about ~250. It remains steadfast
at
>>>>>> 73.
>>>>>> Is my understanding wrong? With a smaller block size and bzipped
files,
>>>>>> should I not get more mappers?
>>>>>> Raj
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joseph Echeverria
>>>>> Cloudera, Inc.
>>>>> 443.305.9434
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joseph Echeverria
>>>> Cloudera, Inc.
>>>> 443.305.9434
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>>
>>>
>>>
>>
>>
>>
>>--
>>Joseph Echeverria
>>Cloudera, Inc.
>>443.305.9434
>>
>>
>>
>>________________________________
> The information and any attached documents contained in this message
>may be confidential and/or legally privileged. The message is
>intended solely for the addressee(s). If you are not the intended
>recipient, you are hereby notified that any use, dissemination, or
>reproduction is strictly prohibited and may be unlawful. If you are
>not the intended recipient, please contact the sender immediately by
>return e-mail and destroy all copies of the original message.
>
>
>

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Mime
View raw message