Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 93AAD1099E for ; Thu, 6 Jun 2013 11:07:17 +0000 (UTC) Received: (qmail 90917 invoked by uid 500); 6 Jun 2013 11:07:17 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 90819 invoked by uid 500); 6 Jun 2013 11:07:14 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 90809 invoked by uid 99); 6 Jun 2013 11:07:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2013 11:07:13 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10 tests=HTML_MESSAGE,MSGID_FROM_MTA_HEADER,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of stepinto@live.com designates 65.55.116.39 as permitted sender) Received: from [65.55.116.39] (HELO blu0-omc1-s28.blu0.hotmail.com) (65.55.116.39) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2013 11:07:06 +0000 Received: from BLU0-SMTP354 ([65.55.116.8]) by blu0-omc1-s28.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 6 Jun 2013 04:06:45 -0700 X-EIP: [/ElcC4p/Wxq4RsQhhphvnG0LS70c5/Wd] X-Originating-Email: [stepinto@live.com] Message-ID: Received: from mail-wg0-f45.google.com ([74.125.82.45]) by BLU0-SMTP354.phx.gbl over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Thu, 6 Jun 2013 04:06:45 -0700 Received: by mail-wg0-f45.google.com with SMTP id n12so1972639wgh.12 for ; Thu, 06 Jun 2013 04:06:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=W7xLBV/yxuUceo3q4rk6y0ObC+FoZXbX4ZRLuLHxGE8=; b=WLVjMv8N1DgolcVHJszs1DSgFuA64+6IMywPYa+XJX0NWe/RjOdPX0v76o730EQQUm kvqufZZ8FpI0d/ukEejguDSO0OaEC4ztq8xOOa/ffbkhzBiFhZo0n1tIrQlaSrH0dxSZ Z/fsvko+guTbXbnMUvuYqn7cXIdLGW4W9QL4fXSvngO2N0ocSrps5JopNhCJUWdyyGMy 4dd+sIfVMFqtfoqS22HZkSrw86ORU/dUs8HA9Z39+9WpZ60uNsm4NJV84NICdpGlrFrs c8IEtIgzLlfufoQyUE2RNu/2X4P9tbgOWznsc8Y3SpKT7W7AMyVl/N14V/nSIbmLKHoO +qtw== X-Received: by 10.194.47.240 with SMTP id g16mr32043092wjn.43.1370516804145; Thu, 06 Jun 2013 04:06:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.54.9 with HTTP; Thu, 6 Jun 2013 04:06:24 -0700 (PDT) From: Chao Shi Date: Thu, 6 Jun 2013 19:06:24 +0800 Subject: Small files produced by a map-only job To: crunch-user@apache.org Content-Type: multipart/alternative; boundary="047d7ba972724516ff04de7a4fc3" X-OriginalArrivalTime: 06 Jun 2013 11:06:45.0290 (UTC) FILETIME=[EE9FD0A0:01CE62A5] X-Virus-Checked: Checked by ClamAV on apache.org --047d7ba972724516ff04de7a4fc3 Content-Type: text/plain; charset="ISO-8859-1" Hey guys, I'm writing MR jobs using crunch. Crunch optimizes some very simple pipeline into map-only jobs, e.g. sample or grep. As MR framework splits the input data by HDFS block, the map phase will produce plenty of small files on HDFS, which is unpleasant and makes the following data processing inefficient. When I write raw MR, I typically append this with an identity reducer, which simply merges outputs from map. I think CRUNCH-162 is related to this. Is there anyone still working on it? Thanks, Chao --047d7ba972724516ff04de7a4fc3 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable
Hey guys,

I'm writing MR jobs= using crunch. Crunch optimizes some very simple pipeline into map-only job= s, e.g. sample or grep.

As MR framewor= k splits the input data by HDFS block, the map phase will produce plenty of= small files on HDFS, which is unpleasant and makes the following data proc= essing inefficient. When I write raw MR, I typically append this with an id= entity reducer, which simply merges outputs from map.

I think CRUNCH-162=A0is related to this. Is there any= one still working on it?

Thanks,
Ch= ao
--047d7ba972724516ff04de7a4fc3--