hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruyue Ma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1667) resource estimation still works badly in some cases
Date Fri, 02 Apr 2010 09:24:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852765#action_12852765

Ruyue Ma commented on MAPREDUCE-1667:

The following mail is from hadoop maillist. It showed the same bug.

HBase mapreduce and ResourceEstimator

Dmitry Chechik
Wed, 24 Mar 2010 13:27:08 -0700

Hi all,

We have an issue that occasionally crops up in the following scenario:
1. We have a fairly small HBase table. (say 400M)
2. We have a larger set of input from HDFS (say 1G bytes)

We run a mapreduce that joins this input (i.e., some of the mappers read
from HDFS, and some read from HBase).
The mappers that read from HBase all have TableSplits, which return 0 for
The HDFS mappers have a non-zero getLength(), which is roughly the file size
of the HDFS input.

Because of this, the total input size of the job is roughly the size of the
HDFS input. Since the HBase table is small, the HBase mappers often finish
first. In that case, ResourceEstimator thinks that completedMapsInputSize is
near 0 (actually, it's equal to the number of HBase mapper tasks, which is
on the order of 10s for us.), but the completedMapsOutputSize is fairly
large (since it's the actual bytes output by the HBase reducers). So, the
ResourceEstimator thinks that the estimated total map output size is

inputSize * completedMapsOutputSize * 2 / completedMapsInputSize
= (1G) * (400M * 2) / (32)

So the total estimated map output size is very large and the HDFS tasks wind
up pending, because we don't have enough resources.

Has anyone else run into this? One solution would be if TableSplit would
return a reasonable estimate of the size of each split, instead of 0, but it
looks like this isn't possible right now.

- Dmitry

> resource estimation still works badly in some cases
> ---------------------------------------------------
>                 Key: MAPREDUCE-1667
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1667
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>            Reporter: Ruyue Ma
>            Assignee: Ruyue Ma
> A premise upon which our current implemention of ResourceEstimator is that the MapInputSize
and MapOutputSize has relation. 
> In many user cases, the premise is not satisfied.  
> e.g. 
> 1. The map input is a file list.
> 2. Every mapper will download the file, process, then output (into map intermediate output)
> if one file name is very long, the estimated output size maybe very big.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message