Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Sat, 2 May 2015 23:33:06 +0000 (UTC)
From: "Hadoop QA (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12729456.1406208642000.62768.1430609586191@Atlassian.JIRA>
In-Reply-To: <JIRA.12729456.1406208642000@Atlassian.JIRA>
References: <JIRA.12729456.1406208642000@Atlassian.JIRA>
 <JIRA.12729456.1406208642941@arcas>
Subject: [jira] [Commented] (MAPREDUCE-6003) Resource Estimator suggests
 huge map output in some cases
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14525549#comment-14525549 ] 

Hadoop QA commented on MAPREDUCE-6003:
--------------------------------------

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | http://issues.apache.org/jira/secure/attachment/12657803/MAPREDUCE-6003-branch-1.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | branch-1 / 5f5138e |
| Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5608/console |


This message was automatically generated.

> Resource Estimator suggests huge map output in some cases
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-6003
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1.2.1
>            Reporter: Chengbing Liu
>            Assignee: Chengbing Liu
>         Attachments: MAPREDUCE-6003-branch-1.2.patch
>
>
> In some cases, ResourceEstimator can return way too large map output estimation. This happens when input size is not correctly calculated.
> A typical case is when joining two Hive tables (one in HDFS and the other in HBase). The maps that process the HBase table finish first, which has a 0 length of inputs due to its TableInputFormat. Then for a map that processes HDFS table, the estimated output size is very large because of the wrong input size, causing the map task not possible to be assigned.
> There are two possible solutions to this problem:
> (1) Make input size correct for each case, e.g. HBase, etc.
> (2) Use another algorithm to estimate the map output, or at least make it closer to reality.
> I prefer the second way, since the first would require all possibilities to be taken care of. It is not easy for some inputs such as URIs.
> In my opinion, we could make a second estimation which is independent of the input size:
> estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10
> Here, multiplying by 10 makes the estimation more conservative, so that it will be less likely to assign it to some where not big enough.
> The former estimation goes like this:
> estimationA = (inputSize * completedMapOutputSize * 2.0) / completedMapInputSize
> My suggestion is to take minimum of the two estimations:
> estimation = min(estimationA, estimationB)


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)