hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bejoy.had...@gmail.com
Subject Re: Input split for a streaming job!
Date Fri, 11 Nov 2011 18:44:17 GMT
Hi Raj
       AFAIK 0.21is an unstable release and I fear anyone would recommend that for production.
You can play around with the same, a better approach would be patching your CDH3u1 with the
required patches for splittable BZip2, but make sure that your new patch doesn't break anything
else.
 
Regards
Bejoy K S

-----Original Message-----
From: Raj V <rajvish@yahoo.com>
Date: Fri, 11 Nov 2011 10:34:18 
To: Tim Broberg<Tim.Broberg@exar.com>; common-user@hadoop.apache.org<common-user@hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Subject: Re: Input split for a streaming job!

Tim

I  am using CDH3 U1. ( 0.20.2+923) which does not have the patch.

I will try and use 0.21

Raj



>________________________________
>From: Tim Broberg <Tim.Broberg@exar.com>
>To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>; Raj V <rajvish@yahoo.com>;
Joey Echeverria <joey@cloudera.com>
>Sent: Friday, November 11, 2011 10:25 AM
>Subject: RE: Input split for a streaming job!
>
>
> 
>What version of hadoop are you using?
> 
>We just stumbled on the Jira item for BZIP2 splitting, and it appears to have been added
in 0.21.
> 
>When I diff 0.20.205 vs trunk, I see
>< public class BZip2Codec implements
>><     org.apache.hadoop.io.compress.CompressionCodec {
>>---
>>> @InterfaceAudience.Public
>>> @InterfaceStability.Evolving
>>> public class BZip2Codec implements SplittableCompressionCodec {
>So, it appears you need at least 0.21 to play with splittability in BZIP2. 
> 
>     - Tim.
>
>________________________________________
>From: Raj V [rajvish@yahoo.com]
>Sent: Friday, November 11, 2011 9:18 AM
>To: Joey Echeverria
>Cc: common-user@hadoop.apache.org
>Subject: Re: Input split for a streaming job!
>
>Joey,Anirudh, Bejoy
>
>I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).
>
>And the input files were created using 32MB block size and the files are bzip2.
>
>So all things point to my input files being spliitable.
>
>I  will continue poking around.
>
>- best regards
>
>Raj
>
>
>
>>________________________________
>>From: Joey Echeverria <joey@cloudera.com>
>>To: Raj V <rajvish@yahoo.com>
>>Sent: Friday, November 11, 2011 2:56 AM
>>Subject: Re: Input split for a streaming job!
>>
>>U1 should be able to split the bzip2 files. What input format are you using?
>>
>>-Joey
>>
>>On Thu, Nov 10, 2011 at 9:06 PM, Raj V <rajvish@yahoo.com> wrote:
>>> Sorry to bother you offline.
>>> From the release notes for CDH3U1
>>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>>> I understand that split of the bzip files was available.
>>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>>> something?
>>> If necessary, I can re-post the mail to the group.,
>>>
>>> ________________________________
>>> From: Joey Echeverria <joey@cloudera.com>
>>> To: rajvish@yahoo.com
>>> Sent: Thursday, November 10, 2011 3:11 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> No problem. Out of curiosity, why are you still using B3?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V <rajvish@yahoo.com> wrote:
>>>> Joey
>>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>>> not
>>>> seem to support bzip splitting. I should have looked before shooting off
>>>> the
>>>> email :-(
>>>> To answer your second question, I created a completely new set of input
>>>> files with dfs.block.size=32MB and used this as the input data
>>>> Raj
>>>>
>>>>
>>>> ________________________________
>>>> From: Joey Echeverria <joey@cloudera.com>
>>>> To: cdh-user@cloudera.org
>>>> Sent: Thursday, November 10, 2011 3:02 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> It depends on the version of hadoop that you're using. Also, when you
>>>> changed the block size, did you do it on the actual files, or just the
>>>> default for new files?
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V <rajvish@yahoo.com> wrote:
>>>>> Hi Joey,
>>>>> I always thought bzip was splittable.
>>>>> Raj
>>>>>
>>>>> ________________________________
>>>>> From: Joey Echeverria <joey@cloudera.com>
>>>>> To: cdh-user@cloudera.org
>>>>> Sent: Thursday, November 10, 2011 2:43 PM
>>>>> Subject: Re: Input split for a streaming job!
>>>>>
>>>>> Gzip and bzip2 compressed files aren't splittable, so you'll always
>>>>> get one mapper per file.
>>>>>
>>>>> -Joey
>>>>>
>>>>> On Thu, Nov 10, 2011 at 5:40 PM, Raj V <rajvish@yahoo.com> wrote:
>>>>>> All
>>>>>> I assumed that the input splits for a streaming job will follow the
same
>>>>>> logic as a map reduce java job but I seem to be wrong.
>>>>>> I started out with 73 gzipped files that vary between 23MB to 255MB
in
>>>>>> size.
>>>>>> My default block size was 128MB.  8 of the 73 files are larger than
128
>>>>>> MB
>>>>>> When I ran my streaming job, it ran, as expected,  73 mappers (
No
>>>>>> reducers
>>>>>> for this job).
>>>>>> Since I have 128 Nodes in my cluster , I thought I would use more
>>>>>> systems
>>>>>> in
>>>>>> the cluster by increasing the number of mappers. I changed all the
gzip
>>>>>> files into bzip2 files. I expected the number of mappers to increase
to
>>>>>> 81.
>>>>>> The mappers remained at 73.
>>>>>> I tried a second experiment- I changed my dfs.block.size to 32MB.
That
>>>>>> should have increased my mappers to about ~250. It remains steadfast
at
>>>>>> 73.
>>>>>> Is my understanding wrong? With a smaller block size and bzipped
files,
>>>>>> should I not get more mappers?
>>>>>> Raj
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joseph Echeverria
>>>>> Cloudera, Inc.
>>>>> 443.305.9434
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joseph Echeverria
>>>> Cloudera, Inc.
>>>> 443.305.9434
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>>
>>>
>>>
>>
>>
>>
>>--
>>Joseph Echeverria
>>Cloudera, Inc.
>>443.305.9434
>>
>>
>>
>>________________________________
> The information and any attached documents contained in this message
>may be confidential and/or legally privileged. The message is
>intended solely for the addressee(s). If you are not the intended
>recipient, you are hereby notified that any use, dissemination, or
>reproduction is strictly prohibited and may be unlawful. If you are
>not the intended recipient, please contact the sender immediately by
>return e-mail and destroy all copies of the original message.
> 
>
>
Mime
View raw message