mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1417) Random decision forest implementation fails in Hadoop 2
Date Sun, 16 Feb 2014 06:27:19 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902646#comment-13902646
] 

Hudson commented on MAHOUT-1417:
--------------------------------

FAILURE: Integrated in Mahout-Quality #2477 (See [https://builds.apache.org/job/Mahout-Quality/2477/])
MAHOUT-1417: Random decision forest implementation fails in Hadoop 2 (smarthi: rev 1568726)
* /mahout/trunk/CHANGELOG
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialBuilder.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1Mapper.java


> Random decision forest implementation fails in Hadoop 2
> -------------------------------------------------------
>
>                 Key: MAHOUT-1417
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1417
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.7, 0.8, 0.9
>         Environment: CDH 4.5.0.1 + Mahout 0.7+patches
>            Reporter: Sean Owen
>              Labels: classifier, random-decision-forests, rdf
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1417.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We've observed two errors in the RDF implementation, one of which stops it from working
on Hadoop 2 (at least I think it is Hadoop 2 only), and one of which just makes the workload
quite imbalanced.
> A key piece of logic in PartialBuilder.java queries mapred.map.tasks to know the total
number of mappers. However this has never been guaranteed to be set to the number of mappers;
it is how a caller sets a default number of mappers, which may be overridden by Hadoop, and
which defaults to 1. 
> I suspect that this may have actually been set, in some or all cases, to the number of
mappers in Hadoop 1, but I am not sure. Certainly, sometimes it will happen to be set to a
value that equals the number of mappers used.
> But when it doesn't it causes the distribution of trees to mappers to be quite wrong.
For example, with 20 trees and 8 mappers in one example, I find that mapred.map.tasks=1. Logging
messages indicate that mapper 0 handles all trees (0-19), mapper 1 handles non-existent 20-39,
etc.
> The result is that most mappers do nothing and one does everything. This results in empty
part-m-xxxxx files. And, that in turn fails the job. (This part I also suspect is new, or
situation-specific, behavior in Hadoop 2. In any event, this code should never have idle mappers
and fixing that avoids whatever is going on there.)
> There's a second less serious issue in how trees are assigned to mappers. When the number
of trees is not a multiple of the number of mappers, the remainer is assigned entirely to
mapper 0. So with 20 trees and 8 mappers, all mappers build 2 trees, but mapper 0 builds 6.
This is unnecessarily imbalanced.
> Patch coming once I can verify the fix, but current proposal is to:
> - Compute the number of maps ahead of time using TextInputFormat and set mapred.map.tasks
> - Fix the method that computes trees per mapper to spread as evenly as possible (i.e.
all mappers build either N or N+1 trees)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message