crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Shi <>
Subject Small files produced by a map-only job
Date Thu, 06 Jun 2013 11:06:24 GMT
Hey guys,

I'm writing MR jobs using crunch. Crunch optimizes some very simple
pipeline into map-only jobs, e.g. sample or grep.

As MR framework splits the input data by HDFS block, the map phase will
produce plenty of small files on HDFS, which is unpleasant and makes the
following data processing inefficient. When I write raw MR, I typically
append this with an identity reducer, which simply merges outputs from map.

I think CRUNCH-162 <> is
related to this. Is there anyone still working on it?


View raw message