hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bharath Mundlapudi <bharathw...@yahoo.com>
Subject Re: Cluster Tuning
Date Fri, 08 Jul 2011 19:21:19 GMT
Slow start is an important parameter. Definitely impacts job runtime. My experience in the
past has been that, setting this parameter to too low or setting to too high can have issues
with job latencies. If you are trying to run same job then its easy to set right value but
if your cluster is multi-tenancy then getting this to right requires some benchmarking of
different workloads concurrently.

But you case is interesting, you are running on a single core(How many disks per node?). So
setting to higher side of the spectrum as suggested by Joey makes sense. 


-Bharath





________________________________
From: Joey Echeverria <joey@cloudera.com>
To: common-user@hadoop.apache.org
Sent: Friday, July 8, 2011 9:14 AM
Subject: Re: Cluster Tuning

Set mapred.reduce.slowstart.completed.maps to a number close to 1.0.
1.0 means the maps have to completely finish before the reduce starts
copying any data. I often run jobs with this set to .90-.95.

-Joey

On Fri, Jul 8, 2011 at 11:25 AM, Juan P. <gordoslocos@gmail.com> wrote:
> Here's another thought. I realized that the reduce operation in my
> map/reduce jobs is a flash. But it goes reaaaaaaaaally slow until the
> mappers end. Is there a way to configure the cluster to make the reduce wait
> for the map operations to complete? Specially considering my hardware
> restraints
>
> Thanks!
> Pony
>
> On Fri, Jul 8, 2011 at 11:41 AM, Juan P. <gordoslocos@gmail.com> wrote:
>
>> Hey guys,
>> Thanks all of you for your help.
>>
>> Joey,
>> I tweaked my MapReduce to serialize/deserialize only escencial values and
>> added a combiner and that helped a lot. Previously I had a domain object
>> which was being passed between Mapper and Reducer when I only needed a
>> single value.
>>
>> Esteban,
>> I think you underestimate the constraints of my cluster. Adding multiple
>> jobs per JVM really kills me in terms of memory. Not to mention that by
>> having a single core there's not much to gain in terms of paralelism (other
>> than perhaps while a process is waiting of an I/O operation). Still I gave
>> it a shot, but even though I kept changing the config I always ended with a
>> Java heap space error.
>>
>> Is it me or performance tuning is mostly a per job task? I mean it will, in
>> the end, depend on the the data you are processing (structure, size, weather
>> it's in one file or many, etc). If my jobs have different sets of data,
>> which are in different formats and organized in different  file structures,
>> Do you guys recommend moving some of the configuration to Java code?
>>
>> Thanks!
>> Pony
>>
>> On Thu, Jul 7, 2011 at 7:25 PM, Ceriasmex <ceriasmex@gmail.com> wrote:
>>
>>> Eres el Esteban que conozco?
>>>
>>>
>>>
>>> El 07/07/2011, a las 15:53, Esteban Gutierrez <esteban@cloudera.com>
>>> escribió:
>>>
>>> > Hi Pony,
>>> >
>>> > There is a good chance that your boxes are doing some heavy swapping and
>>> > that is a killer for Hadoop.  Have you tried
>>> > with mapred.job.reuse.jvm.num.tasks=-1 and limiting as much possible the
>>> > heap on that boxes?
>>> >
>>> > Cheers,
>>> > Esteban.
>>> >
>>> > --
>>> > Get Hadoop!  http://www.cloudera.com/downloads/
>>> >
>>> >
>>> >
>>> > On Thu, Jul 7, 2011 at 1:29 PM, Juan P. <gordoslocos@gmail.com> wrote:
>>> >
>>> >> Hi guys!
>>> >>
>>> >> I'd like some help fine tuning my cluster. I currently have 20 boxes
>>> >> exactly
>>> >> alike. Single core machines with 600MB of RAM. No chance of upgrading
>>> the
>>> >> hardware.
>>> >>
>>> >> My cluster is made out of 1 NameNode/JobTracker box and 19
>>> >> DataNode/TaskTracker boxes.
>>> >>
>>> >> All my config is default except i've set the following in my
>>> >> mapred-site.xml
>>> >> in an effort to try and prevent choking my boxes.
>>> >> *<property>*
>>> >> *      <name>mapred.tasktracker.map.tasks.maximum</name>*
>>> >> *      <value>1</value>*
>>> >> *  </property>*
>>> >>
>>> >> I'm running a MapReduce job which reads a Proxy Server log file (2GB),
>>> maps
>>> >> hosts to each record and then in the reduce task it accumulates the
>>> amount
>>> >> of bytes received from each host.
>>> >>
>>> >> Currently it's producing about 65000 keys
>>> >>
>>> >> The hole job takes forever to complete, specially the reduce part. I've
>>> >> tried different tuning configs by I can't bring it down under 20mins.
>>> >>
>>> >> Any ideas?
>>> >>
>>> >> Thanks for your help!
>>> >> Pony
>>> >>
>>>
>>
>>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message