Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 986D49B62 for ; Mon, 23 Apr 2012 11:09:54 +0000 (UTC) Received: (qmail 99320 invoked by uid 500); 23 Apr 2012 11:09:50 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 99219 invoked by uid 500); 23 Apr 2012 11:09:48 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 99207 invoked by uid 99); 23 Apr 2012 11:09:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Apr 2012 11:09:48 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qc0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Apr 2012 11:09:40 +0000 Received: by qcsd1 with SMTP id d1so8536977qcs.35 for ; Mon, 23 Apr 2012 04:09:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=9SMVvOd++vGDXWgDfC0qf6btNOlGVOG9za3VCSDDGW4=; b=XnXLrqrBdhSvuzC6qL26XXkVzvxs1ir9qfKcAFMQtnNgRV6PjEZPWW7qLbgZ6KMk1n OhqC6evx4ODRZ1tf+cvVi98amWlURiAZ7OYI6KlwziamEHOU1jlnqwITV5ZyivBJAwe2 jUZA9iPXUAX/ib2f9hPtxdNaTfBBMTQSbp0m33Zp96OVIlUjnSxprHhPt/ElP6BnZyeH ZoRxfGtRoAUyeWJ3M4gOwt/L2p8Yr5Z9p93wokNvzqoVxSRq5YVuTsEE/Gcs6kf5DYn4 GrHr2Ph9YlDZReGzsK2iDcrKtDt0YenEbBf6skMM7Nr+8Lne1CSMo40eLDQz0orqTRRf fE8Q== Received: by 10.224.200.198 with SMTP id ex6mr13113354qab.63.1335179359812; Mon, 23 Apr 2012 04:09:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.55.134 with HTTP; Mon, 23 Apr 2012 04:08:59 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Mon, 23 Apr 2012 16:38:59 +0530 Message-ID: Subject: Re: Reading data output by MapFileOutputFormat To: common-user@hadoop.apache.org, safdar.kureishy@gmail.com Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnx70QYHcxtdgt1U9Hb21U+sxMONPmHOLUDZUyUeMyp9SzHjkmBo9ZTWLgUGUecoZ5E4BuD Ali, MapFiles are explained at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html - Please give it a read and it should solve half your questions. In short, MapFile is two files - one raw SequenceFile and another an index file built on top of it. The reason MR does not provide a MapFileInputFormat is that you don't need to use the index file in MR jobs (no lookups for input-driven jobs). Hence the SequenceFileInputFormat suffices to read the data (it ignores the index file, and only reads the sequence ones that carries the data). If you wish to make use of MapFile's index abilities for lookups/etc., use the MapFile.Reader class directly in your implementation. On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy wrote: > Hi, > > If I use a *MapFileOutputFormat* to output some data, I see that each > reducer's output is a folder ("part-00000", for example), and inside that > folder are two files: "data" and "index". > > However, there is no corresponding MapFileInputFormat, to read back this > folder ("part-00000"). Instead, *SequenceFileInputFormat* seems to read the > data. So, I have some questions: > - does SequenceFileInputFormat actually read *all* the data that was output > by MapFileOutputFormat? Or is some relationship data between the data and > index files lost in this process that would have been better handled by > another InputFormat class? In other words, is SequenceFileInputFormat the > right InputFormat to read data written by MapFileOutputFormat? > - how is it that SequenceFileInputFormat works to read outputs from > *both*MapFileOutputFormat and SequenceFileOutputFormat? That would > imply that > MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR > that SequenceFileInputFormat internally handles both differently. What is > the reality? > > Thanks, > Safdar -- Harsh J