hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit <ro...@hortonworks.com>
Subject Re: BZip2 Splittable?
Date Fri, 24 Feb 2012 19:59:32 GMT
Hi Daniel,  

Because your MapReduce jobs will not split bzip2 files, each entire bzip2 file will be processed
by one Map task. Thus, if your job takes multiple bzip2 text files as the input, then you'll
have as many Map tasks as you have files running in parallel.

The Map tasks will be run by your TaskTrackers. Usually the cluster setup has the DataNode
and the TaskTracker processing running on the same machines - so with 6 data nodes, you have
6 tasktrackers.

Hope that answers your question.


Rohit Bakhshi



www.hortonworks.com (http://www.hortonworks.com/)



On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote:  
> Hi Rohit, thanks for the response, this is pretty much as I expected and hopefully adds
weight to my other thoughts...
>  
> Could this mean that all my datanodes are being sent all of the data or that only one
datanode is executing the job.  
>  
> Thanks again , Dan.
>  
> -----Original Message-----
> From: Rohit Bakhshi [mailto:rohit@hortonworks.com]  
> Sent: 24 February 2012 15:54
> To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org)
> Subject: Re: BZip2 Splittable?
>  
> Daniel,  
>  
> I just noticed your Hadoop version - 0.20.2.
>  
> The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it may not
be supported on your version of Hadoop.  
>  
> --  
> Rohit Bakhshi
> www.hortonworks.com (http://www.hortonworks.com/)
>  
>  
>  
>  
> On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:
>  
> > Hi Daniel,  
> >  
> > Bzip2 compression codec allows for splittable files.
> >  
> > According to this Hadoop JIRA improvement, splitting of bzip2 compressed files in
Hadoop jobs is supported:
> > https://issues.apache.org/jira/browse/HADOOP-4012
> >  
> > --  
> > Rohit Bakhshi
> > www.hortonworks.com (http://www.hortonworks.com/)
> >  
> >  
> >  
> >  
> > On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
> >  
> > > Hi All,
> > >  
> > > I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707
that take a series of bzip2 compressed text files as input.
> > >  
> > > I have read conflicting articles regarding whether or not hadoop can split
these bzip2 files, can anyone give me a definite answer?
> > >  
> > > Thanks is advance, Dan.  
>  
>  
> ________________________________________________________________________
>  
> CONFIDENTIALITY - This email and any files transmitted with it, are confidential, may
be legally privileged and are intended solely for the use of the individual or entity to whom
they are addressed. If this has come to you in error, you must not copy, distribute, disclose
or use any of the information it contains. Please notify the sender immediately and delete
them from your system.
>  
> SECURITY - Please be aware that communication by email, by its very nature, is not 100%
secure and by communicating with Perform Group by email you consent to us monitoring and reading
any such correspondence.
>  
> VIRUSES - Although this email message has been scanned for the presence of computer viruses,
the sender accepts no liability for any damage sustained as a result of a computer virus and
it is the recipient’s responsibility to ensure that email is virus free.
>  
> AUTHORITY - Any views or opinions expressed in this email are solely those of the sender
and do not necessarily represent those of Perform Group.
>  
> COPYRIGHT - Copyright of this email and any attachments belongs to Perform Group, Companies
House Registration number 6324278.  


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message