hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy <snickerdoodl...@gmail.com>
Subject Re: wordcount getting slower with more mappers and reducers?
Date Thu, 05 Mar 2009 17:22:39 GMT
I specified a directory containing my 428MB file split into 8 files. Same
results.

I should summarize my hadoop-site.xml file:

mapred.tasktracker.tasks.maximum = 4
mapred.line.input.format.linespermap = 1
mapred.task.timeout = 0
mapred.min.split.size = 1
mapred.child.java.opts = -Xmx20000M
io.sort.factor = 200
io.sort.mb = 100
fs.inmemory.size.mb = 200
mapred.inmem.merge.threshold = 1000
dfs.replication = 1
mapred.reduce.parallel.copies = 5

I know the mapred.child.java.opts parameter is a little ridiculous, but I
was just playing around and seeing what could possibly make things faster.
For some reason, that did.

Nick, I'm going to try larger files and get back to you.

-SM

On Thu, Mar 5, 2009 at 10:37 AM, Nick Cen <cenyongh@gmail.com> wrote:

> Try to split your sample.txt into multi files.  and try it again.
> For text input format , the number of task is equals to the input size.
>
>
> 2009/3/6 Sandy <snickerdoodle08@gmail.com>
>
> > I used three different sample.txt files, and was able to replicate the
> > error. The first was 1.5MB, the second 66MB, and the last 428MB. I get
> the
> > same problem despite what size of input file I use: the running time of
> > wordcount increases with the number of mappers and reducers specified. If
> > it
> > is the problem of the input file, how big do I have to go before it
> > disappears entirely?
> >
> > If it is psuedo-distributed mode that's the issue, what mode should I be
> > running on my machine, given it's specs? Once again, it is a SINGLE
> MacPro
> > with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.
> >
> > I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
> > seems to be taking the longest:
> > 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
> > 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
> > 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec
> >
> > To make sure it's not because of the combiner, I removed it and reran
> > everything again, and got the same bottom-line: With increasing maps and
> > reducers, running time goes up, with majority of time seeming to be in
> > sort/merge.
> >
> > Also, another thing we noticed is that the CPUs seem to be very active
> > during the map phase, but when the map phase reaches 100%, and only
> reduce
> > appears to be running, the CPUs all become idle. Furthermore, despite the
> > number of mappers I specify, all the CPUs become very active when a job
> is
> > running. Why is this so? If I specify 2 mappers and 2 reducers, won't
> there
> > be just 2 or 4 CPUs that should be active? Why are all 8 active?
> >
> > Since I can reproduce this error using Hadoop's standard word count
> > example,
> > I was hoping that someone else could tell me if they can reproduce this
> > too.
> > Is it true that when you increase the number of mappers and reducers on
> > your
> > systems, the running time of wordcount goes up?
> >
> > Thanks for the help! I'm looking forward to your responses.
> >
> > -SM
> >
> > On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
> > amarsri@yahoo-inc.com> wrote:
> >
> > > Are you hitting HADOOP-2771?
> > > -Amareshwari
> > >
> > > Sandy wrote:
> > >
> > >> Hello all,
> > >>
> > >> For the sake of benchmarking, I ran the standard hadoop wordcount
> > example
> > >> on
> > >> an input file using 2, 4, and 8 mappers and reducers for my job.
> > >> In other words,  I do:
> > >>
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> > >> sample.txt output
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> > >> sample.txt output2
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> > >> sample.txt output3
> > >>
> > >> Strangely enough, when this increase in mappers and reducers result in
> > >> slower running times!
> > >> -On 2 mappers and reducers it ran for 40 seconds
> > >> on 4 mappers and reducers it ran for 60 seconds
> > >> on 8 mappers and reducers it ran for 90 seconds!
> > >>
> > >> Please note that the "sample.txt" file is identical in each of these
> > runs.
> > >>
> > >> I have the following questions:
> > >> - Shouldn't wordcount get -faster- with additional mappers and
> reducers,
> > >> instead of slower?
> > >> - If it does get faster for other people, why does it become slower
> for
> > >> me?
> > >>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> > Pro
> > >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> > >>
> > >> I would greatly appreciate it if someone could explain this behavior
> to
> > >> me,
> > >> and tell me if I'm running this wrong. How can I change my settings
> (if
> > at
> > >> all) to get wordcount running faster when i increases that number of
> > maps
> > >> and reduces?
> > >>
> > >> Thanks,
> > >> -SM
> > >>
> > >>
> > >>
> > >
> > >
> >
>
>
>
> --
> http://daily.appspot.com/food/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message