hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HADOOP-1054) Add more then one input file per map?
Date Wed, 03 Oct 2007 13:29:53 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Enis Soztutar resolved HADOOP-1054.

    Resolution: Duplicate

HADOOP-1515 does exactly the same. 

> Add more then one input file per map?
> -------------------------------------
>                 Key: HADOOP-1054
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1054
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.11.2
>            Reporter: Johan Oskarsson
>            Priority: Trivial
> I've got a problem with mapreduce overhead when it comes to small input files.
> Roughly 100 mb comes in to the dfs every few hours. Then afterwards data related to that
batch might be added on for another few weeks.
> The problem is that this data is roughly 4-5 kbytes per file. So for every reasonably
big file we might have 4-5 small ones.
> As far as I understand it each small file will get assigned a task of it's own. This
causes performance issues since the overhead of such small
> files is pretty big.
> Would it be possible to have hadoop assign multiple files to a map task up until a configurable

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message