Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 03064EFE0 for ; Thu, 6 Dec 2012 14:41:37 +0000 (UTC) Received: (qmail 46631 invoked by uid 500); 6 Dec 2012 14:41:31 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 46431 invoked by uid 500); 6 Dec 2012 14:41:25 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 46383 invoked by uid 99); 6 Dec 2012 14:41:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 14:41:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of x6i4uyzbz.labs@gmail.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 14:41:12 +0000 Received: by mail-ie0-f176.google.com with SMTP id 13so11191028iea.35 for ; Thu, 06 Dec 2012 06:40:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:x-google-sender-delegation:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type; bh=UUpizYfB9209DZ7BH7EkR25N5WWZe9iuFoL/dPkEscw=; b=FsQ2LoZsdzADT1WzEw/fUDtT/cjPyuMj8HnFFx6u8mw+KnhjxgF9+i/GrVh41Lryor BgjmjZzDzPYRTf9xF2qJiA/02hNPjBWNS6pS8Ev9oHPXLPvNItg90n67Rg1mKPc2rUBe 7EsNidYjhQmIjItLtcWIeTXmaP6fqEEp9cbQvH/5q4jwt2+RwB/xYaKz7F53bJ3A5+dN a48xMZlAc96FsW3KgAMu3P7VZD0eCDjtke2g7v3Veh27wW5QFlF04AZr62x/5n0bgGKp Fq4SerpQIT+moCMaAldam/8BvzuY+f4NSHyNEhMhh0L/Dm2xC0hPxz9zCrQCLtYZ13xK EFGw== MIME-Version: 1.0 Received: by 10.42.68.203 with SMTP id y11mr1460101ici.26.1354804851794; Thu, 06 Dec 2012 06:40:51 -0800 (PST) Sender: gpolaert@gmail.com X-Google-Sender-Delegation: gpolaert@gmail.com Received: by 10.50.51.166 with HTTP; Thu, 6 Dec 2012 06:40:51 -0800 (PST) In-Reply-To: References: Date: Thu, 6 Dec 2012 15:40:51 +0100 X-Google-Sender-Auth: KDYFpTQ29yUCUbnVO74hMXQS9AU Message-ID: Subject: Re: M/R, Strange behavior with multiple Gzip files From: x6i4uybz labs To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf30334b5fee641904d0301573 X-Virus-Checked: Checked by ClamAV on apache.org --20cf30334b5fee641904d0301573 Content-Type: text/plain; charset=ISO-8859-1 Hello, The job isn't running in local mode. In fact, I think I have just a problem with the map task progression. The counters of each map task are OK during the job execution whereas the progression of each map task stays at 0%. On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari wrote: > Hi, > > Have you configured the mapredsite.xml to tell where the job tracker > is? If not, your job is running on the local jobtracker, running the > tasks one by one. > > JM > > PS: I faced the same issue few weeks ago and got the exact same > behaviour. This (above) solved the issue. > > 2012/12/6, x6i4uybz labs : > > Sorry, > > > > I wrote a job M/R to process several gz files (about 2000). I've a 80 map > > slots cluster > > JT instantiates one map per gz file (not splittable, it's OK). > > > > The first 80 maps spawn. But after "initializing" state, it seems there > is > > one map running. And when this map is finished, another one started (not > 80 > > maps in parallel) and another is affected to the empty slot. > > > > I've also noticed, the first maps last more than one hour and the last > maps > > 50 seconds. > > Each gz file is between 10mb and 100mb. > > > > I don't understand the behavior. > > I will launch again the job to see if I've the same issue. > > > > thanks, gpo > > > > > > > > > > > > > > > > > > On Wed, Dec 5, 2012 at 6:33 PM, Harsh J wrote: > > > >> Your problem isn't clear in your description - can you please > >> rephrase/redefine in terms of what you are expecting vs. what you are > >> observing. > >> > >> Also note that Gzip files are not splittable by nature of their codec > >> algorithm, and hence a TextInputFormat over plain/regular Gzip files > >> would end up spawning and/or processing one whole Gzip file via one > >> mapper, instead of multiple mappers per file. > >> > >> On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs > > >> wrote: > >> > Hi everybody, > >> > > >> > I have a M/R job which does a bulk import to hbase. > >> > I have to process many gzip files (2800 x ~ 100mb) > >> > > >> > I don't understand why my job instanciates 80 maps but runs each map > >> > sequentialy like if there is only one big gz file. > >> > > >> > Is there a problem in my driver ? Or maybe I miss something. > >> > I use "FileInputFormat.addInputPath(job, new Path(args[0]))" where > >> args[0] > >> > is a directory. > >> > > >> > Can you help me, please ? > >> > > >> > Thanks, Guillaume > >> > >> > >> > >> -- > >> Harsh J > >> > > > --20cf30334b5fee641904d0301573 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello,

The job isn't running in local mode. In fact, I thi= nk I have just a problem with the map task progression.
The counters of = each map task are OK during the job execution whereas the progression of ea= ch map task stays at 0%.



On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari <jean-ma= rc@spaggiari.org> wrote:
Hi,

Have you configured the mapredsite.xml to tell where the job tracker
is? If not, your job is running on the local jobtracker, running the
tasks one by one.

JM

PS: I faced the same issue few weeks ago and got the exact same
behaviour. This (above) solved the issue.

2012/12/6, x6i4uybz labs <x6= i4uyzbz.labs@gmail.com>:
> Sorry,
>
> I wrote a job M/R to process several gz files (about 2000). I've a= 80 map
> slots cluster
> JT instantiates one map per gz file (not splittable, it's OK).
>
> The first 80 maps spawn. But after "initializing" state, =A0= it seems there is
> one map running. And when this map is finished, another one started (n= ot 80
> maps in parallel) and another is affected to the empty slot.
>
> I've also noticed, the first maps last more than one hour and the = last maps
> 50 seconds.
> Each gz file is between 10mb and 100mb.
>
> I don't understand the behavior.
> I will launch again the job to see if I've the same issue.
>
> thanks, gpo
>
>
>
>
>
>
>
>
> On Wed, Dec 5, 2012 at 6:33 PM, Harsh J <harsh@cloudera.com> wrote:
>
>> Your problem isn't clear in your description - can you please<= br> >> rephrase/redefine in terms of what you are expecting vs. what you = are
>> observing.
>>
>> Also note that Gzip files are not splittable by nature of their co= dec
>> algorithm, and hence a TextInputFormat over plain/regular Gzip fil= es
>> would end up spawning and/or processing one whole Gzip file via on= e
>> mapper, instead of multiple mappers per file.
>>
>> On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com>
>> wrote:
>> > Hi everybody,
>> >
>> > I have a M/R job which does a bulk import to hbase.
>> > I have to process many gzip files (2800 x ~ 100mb)
>> >
>> > I don't understand why my job instanciates 80 maps but ru= ns each map
>> > sequentialy like if there is only one big gz file.
>> >
>> > Is there a problem in my driver ? Or maybe I miss something.<= br> >> > I use "FileInputFormat.addInputPath(job, new Path(args[0= ]))" where
>> args[0]
>> > is a directory.
>> >
>> > Can you help me, please ?
>> >
>> > Thanks, Guillaume
>>
>>
>>
>> --
>> Harsh J
>>
>

--20cf30334b5fee641904d0301573--