hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan Oskarson (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-1054) Add more then one input file per map?
Date Thu, 01 Mar 2007 18:22:50 GMT
Add more then one input file per map?
-------------------------------------

                 Key: HADOOP-1054
                 URL: https://issues.apache.org/jira/browse/HADOOP-1054
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.11.2
            Reporter: Johan Oskarson
            Priority: Trivial


I've got a problem with mapreduce overhead when it comes to small input files.

Roughly 100 mb comes in to the dfs every few hours. Then afterwards data related to that batch
might be added on for another few weeks.
The problem is that this data is roughly 4-5 kbytes per file. So for every reasonably big
file we might have 4-5 small ones.

As far as I understand it each small file will get assigned a task of it's own. This causes
performance issues since the overhead of such small
files is pretty big.

Would it be possible to have hadoop assign multiple files to a map task up until a configurable
limit?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message