hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Map performance with custom binary format
Date Tue, 28 Jul 2009 16:58:03 GMT
Well, the first thing to do in any performance bottleneck investigation is
to look at the machine hardware resource usage.

During your test, what is the CPU use and disk usage?  What about network
Top, vmstat, iostat, and some network usage monitoring would be useful.  It
could be many things causing your lack of scalability, but without actually
monitoring your machines to see if there is an obvious bottleneck its just
random guessing and hunches.

On 7/28/09 8:18 AM, "william kinney" <william.kinney@gmail.com> wrote:

> Hi,
> Thanks in advance for the help!
> I have a performance question relating to how fast I can expect Hadoop
> to scale. Running Cloudera's 0.18.3-10.
> I have custom binary format, which is just Google Protocol Buffer
> (protobuf) serialized data:
>   669 files, ~30GB total size (ranging 10MB to 100MB each).
>   128MB block size.
>   10 Hadoop Nodes.
> I tested my InputFormat and RecordReader for my input format, and it
> showed about 56MB/s performance (single thread, no hadoop, passed in
> test file via FileInputFormat instead of FSDataInputStream) on
> hardware similar to what I have in my cluster.
> I also then tested some simple Map logic along w/ the above, and got
> around 54MB/s. I believe that difference can be accounted for parsing
> the protobuf data into java objects.
> Anyways, when I put this logic into a job that has
>   - no reduce (.setNumReduceTasks(0);)
>   - no emit
>   - just protobuf parsing calls (like above)
> I get a finish time of 10mins, 25sec, which is about 106.24 MB/s.
> So my question, why is the rate only 2x what I see on a single thread,
> non-hadoop test? Would it not be:
>   54MB/s x 10 (Num Nodes) - small hadoop overhead ?
> Is there any area of my configuration I should look into for tuning?
> Anyway I could get more accurate performance monitoring of my job?
> On a side note, I tried the same job after combining the files into
> about 11 files (still 30GB in size), and actually saw a decrease in
> performance (~90MB/s).
> Any help is appreciated. Thanks!
> Will
> some hadoop-site.xml values:
> dfs.replication  3
> io.file.buffer.size   65536
> dfs.datanode.handler.count  3
> mapred.tasktracker.map.tasks.maximum  6
> dfs.namenode.handler.count  5

View raw message