hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkumar Vadali (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1838) DistRaid map tasks have large variance in running times
Date Wed, 02 Jun 2010 21:37:38 GMT
DistRaid map tasks have large variance in running times
-------------------------------------------------------

                 Key: MAPREDUCE-1838
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1838
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: contrib/raid
    Affects Versions: 0.20.1
            Reporter: Ramkumar Vadali
            Priority: Minor


HDFS RAID uses map-reduce jobs to generate parity files for a set of source files. Each map
task gets a subset of files to operate on. The current code assigns files by walking through
the list of files given in the constructor of DistRaid

The problem is that the list of files given to the constructor has the order of (pretty much)
the directory listing. When a large number of files is added, files in that order tend to
have the same size. Thus a map task can end up with large files where as another can end up
with small files, increasing the variance in run times.

We could do smarter assignment by using the file sizes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message