crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Small files produced by a map-only job
Date Thu, 06 Jun 2013 12:04:56 GMT
Hey Chao,

It had dropped off my radar, but I'm happy to throw together a patch to do
it this AM.


On Thu, Jun 6, 2013 at 4:06 AM, Chao Shi <> wrote:

> Hey guys,
> I'm writing MR jobs using crunch. Crunch optimizes some very simple
> pipeline into map-only jobs, e.g. sample or grep.
> As MR framework splits the input data by HDFS block, the map phase will
> produce plenty of small files on HDFS, which is unpleasant and makes the
> following data processing inefficient. When I write raw MR, I typically
> append this with an identity reducer, which simply merges outputs from map.
> I think CRUNCH-162 <> is
> related to this. Is there anyone still working on it?
> Thanks,
> Chao

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message