hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Created: (HADOOP-1054) Add more then one input file per map?
Date Thu, 01 Mar 2007 18:46:47 GMT
The issue described here can be probably solved by specifying the 
appropriate number of map tasks and specifying custom input splits.

However I'd suggest a tool to be implemented that supports the 
following operation on DFS files:
Concatenate several DFS files into a single one.
An option would specify whether it is done
-- destructively (the blocks of the files do not change, but just 
re-linked into a single file), or 
-- non-destructively (copy the data into a new file, with a different 
block size).  
Applied to a single file, this operation can be used to change the 
block size.
It can applied to a whole directory to turn the output of a map-reduce 
job into a single file without running another job

The latter is quite common operation.  I usually do   DFS -getmerge 
followed by DFS -put.   Quite ugly.

On Mar 1, 2007, at 10:22 AM, Johan Oskarson (JIRA) wrote:

> Add more then one input file per map?
> -------------------------------------
>
>                  Key: HADOOP-1054
>                  URL: https://issues.apache.org/jira/browse/HADOOP-1054
>              Project: Hadoop
>           Issue Type: Improvement
>           Components: mapred
>     Affects Versions: 0.11.2
>             Reporter: Johan Oskarson
>             Priority: Trivial
>
>
> I've got a problem with mapreduce overhead when it comes to small 
> input files.
>
> Roughly 100 mb comes in to the dfs every few hours. Then afterwards 
> data related to that batch might be added on for another few weeks.
> The problem is that this data is roughly 4-5 kbytes per file. So for 
> every reasonably big file we might have 4-5 small ones.
>
> As far as I understand it each small file will get assigned a task of 
> it's own. This causes performance issues since the overhead of such 
> small
> files is pretty big.
>
> Would it be possible to have hadoop assign multiple files to a map 
> task up until a configurable limit?
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>


Mime
View raw message