hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Map performance with custom binary format
Date Tue, 28 Jul 2009 18:13:12 GMT
Scott's advice is sage.  There is bound to be a smoking gun here.  Run
ganglia and look at the machine and hadoop stats.

It looks like you have done a very nice job to this point and if the maps
are pulling data from a local disk, then I would expect performance to be
comparable.  If, for some inexplicable (right now) reason the data is not
local to the maps, then performance would be limited.  You should be able to
see on the job status page how many maps were local to the data.

Another issue might be that there are too many maps being started so that
the splits are small.  It might be nice to log that.

On Tue, Jul 28, 2009 at 9:58 AM, Scott Carey <scott@richrelevance.com>wrote:

> Well, the first thing to do in any performance bottleneck investigation is
> to look at the machine hardware resource usage.
>
> During your test, what is the CPU use and disk usage?  What about network
> utilization?
> Top, vmstat, iostat, and some network usage monitoring would be useful.  It
> could be many things causing your lack of scalability, but without actually
> monitoring your machines to see if there is an obvious bottleneck its just
> random guessing and hunches.
>
>
>
> On 7/28/09 8:18 AM, "william kinney" <william.kinney@gmail.com> wrote:
>
> > Hi,
> >
> > Thanks in advance for the help!
> >
> > I have a performance question relating to how fast I can expect Hadoop
> > to scale. Running Cloudera's 0.18.3-10.
> >
> > I have custom binary format, which is just Google Protocol Buffer
> > (protobuf) serialized data:
> >
> >   669 files, ~30GB total size (ranging 10MB to 100MB each).
> >   128MB block size.
> >   10 Hadoop Nodes.
> >
> > I tested my InputFormat and RecordReader for my input format, and it
> > showed about 56MB/s performance (single thread, no hadoop, passed in
> > test file via FileInputFormat instead of FSDataInputStream) on
> > hardware similar to what I have in my cluster.
> > I also then tested some simple Map logic along w/ the above, and got
> > around 54MB/s. I believe that difference can be accounted for parsing
> > the protobuf data into java objects.
> >
> > Anyways, when I put this logic into a job that has
> >   - no reduce (.setNumReduceTasks(0);)
> >   - no emit
> >   - just protobuf parsing calls (like above)
> >
> > I get a finish time of 10mins, 25sec, which is about 106.24 MB/s.
> >
> > So my question, why is the rate only 2x what I see on a single thread,
> > non-hadoop test? Would it not be:
> >   54MB/s x 10 (Num Nodes) - small hadoop overhead ?
> >
> > Is there any area of my configuration I should look into for tuning?
> >
> > Anyway I could get more accurate performance monitoring of my job?
> >
> > On a side note, I tried the same job after combining the files into
> > about 11 files (still 30GB in size), and actually saw a decrease in
> > performance (~90MB/s).
> >
> > Any help is appreciated. Thanks!
> >
> > Will
> >
> > some hadoop-site.xml values:
> > dfs.replication  3
> > io.file.buffer.size   65536
> > dfs.datanode.handler.count  3
> > mapred.tasktracker.map.tasks.maximum  6
> > dfs.namenode.handler.count  5
> >
>
>


-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message