crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Small files produced by a map-only job
Date Thu, 06 Jun 2013 12:04:56 GMT
Hey Chao,

It had dropped off my radar, but I'm happy to throw together a patch to do
it this AM.

J



On Thu, Jun 6, 2013 at 4:06 AM, Chao Shi <stepinto@live.com> wrote:

> Hey guys,
>
> I'm writing MR jobs using crunch. Crunch optimizes some very simple
> pipeline into map-only jobs, e.g. sample or grep.
>
> As MR framework splits the input data by HDFS block, the map phase will
> produce plenty of small files on HDFS, which is unpleasant and makes the
> following data processing inefficient. When I write raw MR, I typically
> append this with an identity reducer, which simply merges outputs from map.
>
> I think CRUNCH-162 <https://issues.apache.org/jira/browse/CRUNCH-162> is
> related to this. Is there anyone still working on it?
>
> Thanks,
> Chao
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message