hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From x6i4uybz labs <x6i4uyzbz.l...@gmail.com>
Subject Re: M/R, Strange behavior with multiple Gzip files
Date Thu, 06 Dec 2012 16:53:05 GMT
If it's common to see 0%-100% jumps, my job runs normally.
It's OK for me. Thanks for your answers



On Thu, Dec 6, 2012 at 5:39 PM, Harsh J <harsh@cloudera.com> wrote:

> Ok, I can't tell about the performance of your map process, but it is
> sometimes common to see 0% -> 100% jumps in progressbars when working
> over compressed data - as the progress (in terms of data records
> processed overall) can't be perfectly determined. It might even be a
> bug recently fixed.
>
> If your counters are updating fast enough over the minute, then I'd
> assume all is well. The local job runner concerns come from the
> statements of yours that only one map seems to be running at one time,
> but perhaps thats not the case anymore?
>
> On Thu, Dec 6, 2012 at 9:55 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com>
> wrote:
> > Thanks for your answers.
> >
> > I haven't yet the whole solution but I know :
> >   - the job is not running on a local TT
> >   - the map process is very slow
> >   - and the progress bar is not working proprely
> >
> > So, the map tasks are running in parallel (hadoop works :)) but I don't
> > understand why the progression of each map task stays at 0.
> >
> >
> >
> >
> >
> >
> > On Thu, Dec 6, 2012 at 3:48 PM, Harsh J <harsh@cloudera.com> wrote:
> >>
> >> I tend to agree with Jean-Marc's observation. If your job client logs
> >> a "LocalJobRunner" at any point, then that is most definitely your
> >> problem.
> >>
> >> Otherwise, if you feel you are facing a scheduling problem, then it
> >> may most likely be your scheduler configuration. For example,
> >> FairScheduler has a <maxMaps/> attribute over its pools that you can
> >> set to control maximum parallel use of slots for jobs using that pool,
> >> etc..
> >>
> >> On Thu, Dec 6, 2012 at 8:10 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com
> >
> >> wrote:
> >> > Hello,
> >> >
> >> > The job isn't running in local mode. In fact, I think I have just a
> >> > problem
> >> > with the map task progression.
> >> > The counters of each map task are OK during the job execution whereas
> >> > the
> >> > progression of each map task stays at 0%.
> >> >
> >> >
> >> >
> >> > On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari
> >> > <jean-marc@spaggiari.org> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Have you configured the mapredsite.xml to tell where the job tracker
> >> >> is? If not, your job is running on the local jobtracker, running the
> >> >> tasks one by one.
> >> >>
> >> >> JM
> >> >>
> >> >> PS: I faced the same issue few weeks ago and got the exact same
> >> >> behaviour. This (above) solved the issue.
> >> >>
> >> >> 2012/12/6, x6i4uybz labs <x6i4uyzbz.labs@gmail.com>:
> >> >> > Sorry,
> >> >> >
> >> >> > I wrote a job M/R to process several gz files (about 2000). I've
a
> 80
> >> >> > map
> >> >> > slots cluster
> >> >> > JT instantiates one map per gz file (not splittable, it's OK).
> >> >> >
> >> >> > The first 80 maps spawn. But after "initializing" state,  it seems
> >> >> > there
> >> >> > is
> >> >> > one map running. And when this map is finished, another one started
> >> >> > (not
> >> >> > 80
> >> >> > maps in parallel) and another is affected to the empty slot.
> >> >> >
> >> >> > I've also noticed, the first maps last more than one hour and
the
> >> >> > last
> >> >> > maps
> >> >> > 50 seconds.
> >> >> > Each gz file is between 10mb and 100mb.
> >> >> >
> >> >> > I don't understand the behavior.
> >> >> > I will launch again the job to see if I've the same issue.
> >> >> >
> >> >> > thanks, gpo
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 5, 2012 at 6:33 PM, Harsh J <harsh@cloudera.com>
> wrote:
> >> >> >
> >> >> >> Your problem isn't clear in your description - can you please
> >> >> >> rephrase/redefine in terms of what you are expecting vs. what
you
> >> >> >> are
> >> >> >> observing.
> >> >> >>
> >> >> >> Also note that Gzip files are not splittable by nature of
their
> >> >> >> codec
> >> >> >> algorithm, and hence a TextInputFormat over plain/regular
Gzip
> files
> >> >> >> would end up spawning and/or processing one whole Gzip file
via
> one
> >> >> >> mapper, instead of multiple mappers per file.
> >> >> >>
> >> >> >> On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs
> >> >> >> <x6i4uyzbz.labs@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi everybody,
> >> >> >> >
> >> >> >> > I have a M/R job which does a bulk import to hbase.
> >> >> >> > I have to process many gzip files (2800 x ~ 100mb)
> >> >> >> >
> >> >> >> > I don't understand why my job instanciates 80 maps but
runs each
> >> >> >> > map
> >> >> >> > sequentialy like if there is only one big gz file.
> >> >> >> >
> >> >> >> > Is there a problem in my driver ? Or maybe I miss something.
> >> >> >> > I use "FileInputFormat.addInputPath(job, new Path(args[0]))"
> where
> >> >> >> args[0]
> >> >> >> > is a directory.
> >> >> >> >
> >> >> >> > Can you help me, please ?
> >> >> >> >
> >> >> >> > Thanks, Guillaume
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Harsh J
> >> >> >>
> >> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Mime
View raw message