hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1054) Add more then one input file per map?
Date Thu, 01 Mar 2007 18:36:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477046
] 

Doug Cutting commented on HADOOP-1054:
--------------------------------------

This is something you can do with a custom InputFormat, with an InputSplit implementation
where each split wraps multiple FileSplit instances.  Note that user-defined split classes
are broken in the 0.11.x (HADOOP-933). In trunk (0.12.x), however, splits are now generated
in the client and written to a file that's read by map tasks.  The workaround for 0.11.x releases
is to put the split implementation in the lib directory on your JobTracker and TaskTracker
nodes, rather than in the job's jar file.  If this solution is acceptable then I'll resolve
this issue.  Alternately, someone could try to provide a generic InputFormat implementation
that does this as a resolution.

> Add more then one input file per map?
> -------------------------------------
>
>                 Key: HADOOP-1054
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1054
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.11.2
>            Reporter: Johan Oskarson
>            Priority: Trivial
>
> I've got a problem with mapreduce overhead when it comes to small input files.
> Roughly 100 mb comes in to the dfs every few hours. Then afterwards data related to that
batch might be added on for another few weeks.
> The problem is that this data is roughly 4-5 kbytes per file. So for every reasonably
big file we might have 4-5 small ones.
> As far as I understand it each small file will get assigned a task of it's own. This
causes performance issues since the overhead of such small
> files is pretty big.
> Would it be possible to have hadoop assign multiple files to a map task up until a configurable
limit?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message