hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2774) Add counters to show number of key/values that have been sorted and merged in the maps and reduces
Date Thu, 20 Nov 2008 08:58:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649304#action_12649304
] 

Chris Douglas commented on HADOOP-2774:
---------------------------------------

(1) IFile is a package-private class, no? If it's not visible outside of the mapred package,
using other components in that package doesn't seem like pollution to me. If the class responsible
for emitting records counts the number of records it emits using the class responsible for
counting, that seems like a victory for reuse and coherence. Is there a particular reason
the two should be kept separate? Indiscriminate coupling of unrelated components is to be
avoided, certainly, but this strikes me as a plain win.

(2) Just to make sure I understand: the proposal is to add a new interface, IFileDiskOperationsMonitor,
which Task would implement. The Task would pass a reference to itself (or an inner class would
pass a reference to its containing instance) to the IFile.\{Reader,Writer\} instance, which
would hold that reference until it closes, when it would pass its internal count to the IFileDiskOperations
instance, which would update the counter. Is that correct?
* Why the indirection? Why create a new type for monitoring disk operations passing through
a particular, intermediate format, a format already limited to a package that already contains
a type that implements a superset of the new type's functionality?
* Task should not gain a new interface each time we want to track a new quantity or type of
quantity. Further, the IFile format is not part of the Task type. Metrics from the IFile format
certainly are not.
* If the issue is performance, there's no reason why the counter can't be updated in the same
way.
* Passing a named counter in some contexts is far more readable than passing \*Task.this in
some contexts, but not in others.

bq. This seems fairly generic and in the future could be used by any other potential user
of Ifile/Merger classes (outside MapReduce)
And yet it is emphatically less generic than the Counters, which already provide an interface
to consumers outside the mapred package.

> Add counters to show number of key/values that have been sorted and merged in the maps
and reduces
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2774
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2774
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Owen O'Malley
>            Assignee: Ravi Gummadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-2774.patch, HADOOP-2774.patch
>
>
> For each *pass* of the sort and merge, I would like a count of the number of records.
So for example, if the map output 100 records and they were sorted once, the counter would
be 100. If it spilled twice and was merged together, it would be 200. Clearly in a multi-level
merge, it may not be a multiple of the number of map output records. This would let the users
easily see if they have values like io.sort.mb or io.sort.factor set too low.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message