Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 759E0DF80 for ; Thu, 6 Dec 2012 09:57:55 +0000 (UTC) Received: (qmail 84375 invoked by uid 500); 6 Dec 2012 09:57:50 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 84105 invoked by uid 500); 6 Dec 2012 09:57:50 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 84087 invoked by uid 99); 6 Dec 2012 09:57:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 09:57:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of x6i4uyzbz.labs@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-ia0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 09:57:44 +0000 Received: by mail-ia0-f176.google.com with SMTP id k32so5037141iak.35 for ; Thu, 06 Dec 2012 01:57:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:x-google-sender-delegation:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type; bh=zRu+j8uUc7GXQvoR7GComljk4JtFeC9U7L8eRrJPaK8=; b=JNvqXe/82cDFfrw0CrEt8Tr2Bhp12ujbQbqioau+9kL698gchQDpSJdrUp+is4QE9W ohoBD2EQ4qvWGhqmIpqam79WySbPFccPQ9563J1pb9Kz4eUvw8s/9clY5tXfFFV0RcY3 6HLnRvSApZgLiVjx5T3KU1Vt6o5tupmR4ocC6KzJozZjaeLj7gXwMdf2kjoM43lr6ibu aR2zUiPjAvBFSjmJ4nA7GFpXJwsivOrRWjKmj8Jcl8I4oQymvUngZ+FQyLQKWHn67NnM u0ePOdJKHsRj/NY2wU2cTn6/tvIoGluFVKneejMY8B1DQGRjwCcgqxg7FYSoSJniBliV H7Vg== MIME-Version: 1.0 Received: by 10.42.68.203 with SMTP id y11mr744316ici.26.1354787843563; Thu, 06 Dec 2012 01:57:23 -0800 (PST) Sender: gpolaert@gmail.com X-Google-Sender-Delegation: gpolaert@gmail.com Received: by 10.50.51.166 with HTTP; Thu, 6 Dec 2012 01:57:23 -0800 (PST) In-Reply-To: References: Date: Thu, 6 Dec 2012 10:57:23 +0100 X-Google-Sender-Auth: vTqZpm5u_cAWoIlAXk9uaheZKYU Message-ID: Subject: Re: M/R, Strange behavior with multiple Gzip files From: x6i4uybz labs To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf30334b5f2960b604d02c20d6 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30334b5f2960b604d02c20d6 Content-Type: text/plain; charset=ISO-8859-1 Sorry, I wrote a job M/R to process several gz files (about 2000). I've a 80 map slots cluster JT instantiates one map per gz file (not splittable, it's OK). The first 80 maps spawn. But after "initializing" state, it seems there is one map running. And when this map is finished, another one started (not 80 maps in parallel) and another is affected to the empty slot. I've also noticed, the first maps last more than one hour and the last maps 50 seconds. Each gz file is between 10mb and 100mb. I don't understand the behavior. I will launch again the job to see if I've the same issue. thanks, gpo On Wed, Dec 5, 2012 at 6:33 PM, Harsh J wrote: > Your problem isn't clear in your description - can you please > rephrase/redefine in terms of what you are expecting vs. what you are > observing. > > Also note that Gzip files are not splittable by nature of their codec > algorithm, and hence a TextInputFormat over plain/regular Gzip files > would end up spawning and/or processing one whole Gzip file via one > mapper, instead of multiple mappers per file. > > On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs > wrote: > > Hi everybody, > > > > I have a M/R job which does a bulk import to hbase. > > I have to process many gzip files (2800 x ~ 100mb) > > > > I don't understand why my job instanciates 80 maps but runs each map > > sequentialy like if there is only one big gz file. > > > > Is there a problem in my driver ? Or maybe I miss something. > > I use "FileInputFormat.addInputPath(job, new Path(args[0]))" where > args[0] > > is a directory. > > > > Can you help me, please ? > > > > Thanks, Guillaume > > > > -- > Harsh J > --20cf30334b5f2960b604d02c20d6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Sorry,

I wrote a job M/R to process several g= z files (about 2000). I've a 80 map slots cluster
JT instanti= ates=A0one map per gz file (not= splittable, it's OK).

The first 80 maps spawn.=A0But after "initializing" state, =A0it seems there is on= e map running. And when this map is finished, another one started (not 80 m= aps in parallel) and another is affected to the empty slot.

I've also noticed, the first ma= ps last more than one hour and the last maps 50 seconds.
Each gz file is between 10= mb and 100mb.
=
I don't u= nderstand the behavior.
I will launch again the job to see if I've the same issue.

thanks, gpo


=






On Wed, Dec 5= , 2012 at 6:33 PM, Harsh J <harsh@cloudera.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex"> Your problem isn't clear in your description - can you please
rephrase/redefine in terms of what you are expecting vs. what you are
observing.

Also note that Gzip files are not splittable by nature of their codec
algorithm, and hence a TextInputFormat over plain/regular Gzip files
would end up spawning and/or processing one whole Gzip file via one
mapper, instead of multiple mappers per file.

On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com> wrote:
> Hi everybody,
>
> I have a M/R job which does a bulk import to hbase.
> I have to process many gzip files (2800 x ~ 100mb)
>
> I don't understand why my job instanciates 80 maps but runs each m= ap
> sequentialy like if there is only one big gz file.
>
> Is there a problem in my driver ? Or maybe I miss something.
> I use "FileInputFormat.addInputPath(job, new Path(args[0]))"= where args[0]
> is a directory.
>
> Can you help me, please ?
>
> Thanks, Guillaume



--
Harsh J

--20cf30334b5f2960b604d02c20d6--