hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Juan P." <gordoslo...@gmail.com>
Subject Re: Cluster Tuning
Date Fri, 08 Jul 2011 14:41:16 GMT
Hey guys,
Thanks all of you for your help.

Joey,
I tweaked my MapReduce to serialize/deserialize only escencial values and
added a combiner and that helped a lot. Previously I had a domain object
which was being passed between Mapper and Reducer when I only needed a
single value.

Esteban,
I think you underestimate the constraints of my cluster. Adding multiple
jobs per JVM really kills me in terms of memory. Not to mention that by
having a single core there's not much to gain in terms of paralelism (other
than perhaps while a process is waiting of an I/O operation). Still I gave
it a shot, but even though I kept changing the config I always ended with a
Java heap space error.

Is it me or performance tuning is mostly a per job task? I mean it will, in
the end, depend on the the data you are processing (structure, size, weather
it's in one file or many, etc). If my jobs have different sets of data,
which are in different formats and organized in different  file structures,
Do you guys recommend moving some of the configuration to Java code?

Thanks!
Pony

On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ceriasmex@gmail.com> wrote:

> Eres el Esteban que conozco?
>
>
>
> El 07/07/2011, a las 15:53, Esteban Gutierrez <esteban@cloudera.com>
> escribió:
>
> > Hi Pony,
> >
> > There is a good chance that your boxes are doing some heavy swapping and
> > that is a killer for Hadoop.  Have you tried
> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
> > heap on that boxes?
> >
> > Cheers,
> > Esteban.
> >
> > --
> > Get Hadoop!  http://www.cloudera.com/downloads/
> >
> >
> >
> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <gordoslocos@gmail.com> wrote:
> >
> >> Hi guys!
> >>
> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
> >> exactly
> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
> the
> >> hardware.
> >>
> >> My cluster is made out of 1 NameNode/JobTracker box and 19
> >> DataNode/TaskTracker boxes.
> >>
> >> All my config is default except i've set the following in my
> >> mapred-site.xml
> >> in an effort to try and prevent choking my boxes.
> >> *<property>*
> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
> >> *      <value>1</value>*
> >> *  </property>*
> >>
> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
> maps
> >> hosts to each record and then in the reduce task it accumulates the
> amount
> >> of bytes received from each host.
> >>
> >> Currently it's producing about 65000 keys
> >>
> >> The hole job takes forever to complete, specially the reduce part. I've
> >> tried different tuning configs by I can't bring it down under 20mins.
> >>
> >> Any ideas?
> >>
> >> Thanks for your help!
> >> Pony
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message