Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 653AA10AF7 for ; Thu, 6 Jun 2013 12:05:44 +0000 (UTC) Received: (qmail 6851 invoked by uid 500); 6 Jun 2013 12:05:44 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 6762 invoked by uid 500); 6 Jun 2013 12:05:43 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 6754 invoked by uid 99); 6 Jun 2013 12:05:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2013 12:05:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.216.171 as permitted sender) Received: from [209.85.216.171] (HELO mail-qc0-f171.google.com) (209.85.216.171) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2013 12:05:38 +0000 Received: by mail-qc0-f171.google.com with SMTP id z1so350825qcx.2 for ; Thu, 06 Jun 2013 05:05:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=Lcn9OIZnLMhQZJzjrPyJ0es3w8EzqlB4GsvczlBRm7E=; b=fwEqjnvPcMcCVQCJYoJ+UrTr3lasaZ2l2PTKry34Dgrtzw5FW7nJq4MqG1e6R5tTdv Ib2JZB0muogVhUZW6CRxIVmG4oOOs6Pet469o2mMMQNML5XzQNoUeyPmp3TpzDK97/Ip Fy3ni49zLxoHxUNHlnb4uCYMbkZYW7LuQVz4ZtVuX6/hKQDIt7o5jkz6Zg5nNbWvjsNI rFmhJEkJa8sCwjJdENY3EJpQF+IPEMQNEhqW0uHwHmNPEEcRmwhTrDJoWR8Rpaa/kFL1 a0WP2pb0DuveKIs0RjyCg913Dr+gGwWtpEbzET/wApN0W8HR1RL79EE9EHDr7Tl9Yecb k3wQ== X-Received: by 10.229.150.212 with SMTP id z20mr5972872qcv.38.1370520317597; Thu, 06 Jun 2013 05:05:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.169.18 with HTTP; Thu, 6 Jun 2013 05:04:56 -0700 (PDT) In-Reply-To: References: From: Josh Wills Date: Thu, 6 Jun 2013 05:04:56 -0700 Message-ID: Subject: Re: Small files produced by a map-only job To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=e89a8f6469abb023aa04de7b2086 X-Gm-Message-State: ALoCoQlnvPjvbuylH/f+zp/Vo+bml+17Fa6RM4jV/QEtSPZTHRaWbmq2TByqQurcQrPIOiW94qYh X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f6469abb023aa04de7b2086 Content-Type: text/plain; charset=ISO-8859-1 Hey Chao, It had dropped off my radar, but I'm happy to throw together a patch to do it this AM. J On Thu, Jun 6, 2013 at 4:06 AM, Chao Shi wrote: > Hey guys, > > I'm writing MR jobs using crunch. Crunch optimizes some very simple > pipeline into map-only jobs, e.g. sample or grep. > > As MR framework splits the input data by HDFS block, the map phase will > produce plenty of small files on HDFS, which is unpleasant and makes the > following data processing inefficient. When I write raw MR, I typically > append this with an identity reducer, which simply merges outputs from map. > > I think CRUNCH-162 is > related to this. Is there anyone still working on it? > > Thanks, > Chao > -- Director of Data Science Cloudera Twitter: @josh_wills --e89a8f6469abb023aa04de7b2086 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hey Chao,

It had dropped off my r= adar, but I'm happy to throw together a patch to do it this AM.

J
=A0 =A0 =A0=A0
<= div class=3D"gmail_extra">

On Thu, Jun 6, 2013 at 4:06 AM, Chao Shi= <stepinto@live.com> wrote:
Hey guys,

I'm writing MR jobs using= crunch. Crunch optimizes some very simple pipeline into map-only jobs, e.g= . sample or grep.

As MR framework splits the input= data by HDFS block, the map phase will produce plenty of small files on HD= FS, which is unpleasant and makes the following data processing inefficient= . When I write raw MR, I typically append this with an identity reducer, wh= ich simply merges outputs from map.

I think CRUNCH-162=A0is related to this. Is the= re anyone still working on it?

Thanks,
Chao



--
Directo= r of Data Science
Twitter: @josh_wills
--e89a8f6469abb023aa04de7b2086--