Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FB19FFDD for ; Thu, 21 Mar 2013 05:35:20 +0000 (UTC) Received: (qmail 23804 invoked by uid 500); 21 Mar 2013 05:35:20 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 23729 invoked by uid 500); 21 Mar 2013 05:35:19 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 23695 invoked by uid 500); 21 Mar 2013 05:35:19 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 23682 invoked by uid 99); 21 Mar 2013 05:35:19 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Mar 2013 05:35:19 +0000 Date: Thu, 21 Mar 2013 05:35:19 +0000 (UTC) From: "Josh Wills (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CRUNCH-165) Pipelines should automatically use CombineFileInputFormat where input consists of many small files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608654#comment-13608654 ] Josh Wills commented on CRUNCH-165: ----------------------------------- Joe-- thanks for this. Is the right approach here to implemnt CombineFileInputFormats for text/sequence files/avro, and use them in the case that the input warrants it? I don't see a way to make this totally generalizable w/o requiring people to rewrite their input formats if they want to use them with Crunch, which seems like a bad idea. > Pipelines should automatically use CombineFileInputFormat where input consists of many small files > -------------------------------------------------------------------------------------------------- > > Key: CRUNCH-165 > URL: https://issues.apache.org/jira/browse/CRUNCH-165 > Project: Crunch > Issue Type: Improvement > Components: Core > Affects Versions: 0.4.0 > Reporter: Dave Beech > Assignee: Josh Wills > Attachments: CRUNCH-165.patch > > > Hive had a feature introduced in HIVE-74 whereby CombineFileInputFormat would be used if the input data consisted of many small files, making the resulting mapreduce jobs more efficient by giving individual mappers more data to process. This would be a nice feature for Crunch to have, too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira