hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Gummadi (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2774) Add counters to show number of key/values that have been sorted and merged in the maps and reduces
Date Mon, 24 Nov 2008 16:43:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Ravi Gummadi updated HADOOP-2774:

    Attachment: HADOOP-2774.patch

Thanks Chris for the comments.

(1) Would it work if you used a smaller io.sort.mb and calibrated the size of your data to
trigger a fixed number of spills? In the current version, spills should be triggered based
on the number of records, which is a property the test isn't controlling strictly.

Yes. It would work. I reduced the size of input files by a factor of 100 and made io.sort.mb=1

(2) Why run the combiner? Isn't each word coming out of each map unique? 

Wanted to test the path of combiner getting called in Map & Reduce phases --- so have
combiner. Words are repeated(fixed number of times) in the input files.

(3)It might be necessary to set mapred.child.java.opts explicitly to make sure the memory
limit stays fixed, even for different client configurations. Does it not work with mapred.job.shuffle.buffer.percent
= 0? 

OK, Setting mapred.child.java.opts explicitly now. Made mapred.job.shuffle.buffer.percent=0.

(4)The test cannot create its scratch directory in the working dir. It should use the test.build.data
property as the root for its temporary data. It should also clean up when the test completes.


(5)testCounters looks like a unit test and only emits log messages. It seems unnecessary and
less readable than putting the asserts inline with the unit test

OK. Changed the name of the method to validateCounters. As 2 jobs are run in testSpillCounter()(1
with 3 i/p files and another with 4 i/p files), validateCounters() is called twice. So not
inlining it. Hope that is better readable now.

Attached the patch with the above changes. Please review and provide your comments.

> Add counters to show number of key/values that have been sorted and merged in the maps
and reduces
> --------------------------------------------------------------------------------------------------
>                 Key: HADOOP-2774
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2774
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Owen O'Malley
>            Assignee: Ravi Gummadi
>             Fix For: 0.20.0
>         Attachments: HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch, HADOOP-2774.patch
> For each *pass* of the sort and merge, I would like a count of the number of records.
So for example, if the map output 100 records and they were sorted once, the counter would
be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level
merge, it may not be a multiple of the number of map output records. This would let the users
easily see if they have values like io.sort.mb or io.sort.factor set too low.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message