Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 79303 invoked from network); 31 Aug 2007 19:03:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Aug 2007 19:03:42 -0000 Received: (qmail 85840 invoked by uid 500); 31 Aug 2007 19:03:36 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 85812 invoked by uid 500); 31 Aug 2007 19:03:36 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 85803 invoked by uid 99); 31 Aug 2007 19:03:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Aug 2007 12:03:36 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Aug 2007 19:04:43 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8AE1771420C for ; Fri, 31 Aug 2007 12:03:18 -0700 (PDT) Message-ID: <20606859.1188586998565.JavaMail.jira@brutus> Date: Fri, 31 Aug 2007 12:03:18 -0700 (PDT) From: "Doug Cutting (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Created: (HADOOP-1824) want InputFormat for zip files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org want InputFormat for zip files ------------------------------ Key: HADOOP-1824 URL: https://issues.apache.org/jira/browse/HADOOP-1824 Project: Hadoop Issue Type: New Feature Components: mapred Reporter: Doug Cutting HDFS is inefficient with large numbers of small files. Thus one might pack many small files into large, compressed, archives. But, for efficient map-reduce operation, it is desireable to be able to split inputs into smaller chunks, with one or more small original file per split. The zip format, unlike tar, permits enumeration of files in the archive without scanning the entire archive. Thus a zip InputFormat could efficiently permit splitting large archives into splits that contain one or more archived files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.