Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 61870 invoked from network); 26 Jun 2009 18:03:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jun 2009 18:03:05 -0000 Received: (qmail 51159 invoked by uid 500); 26 Jun 2009 18:03:13 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 51091 invoked by uid 500); 26 Jun 2009 18:03:13 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 51081 invoked by uid 99); 26 Jun 2009 18:03:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Jun 2009 18:03:13 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Jun 2009 18:03:00 +0000 Received: from SNV-EXBH01.ds.corp.yahoo.com (snv-exbh01.ds.corp.yahoo.com [207.126.227.249]) by mrout1.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id n5QI1QSs093961 for ; Fri, 26 Jun 2009 11:01:26 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:user-agent:date:subject:from:to:message-id: thread-topic:thread-index:in-reply-to:mime-version:content-type: content-transfer-encoding:x-originalarrivaltime; b=s8vsSEzoed2CjtmiEy75fIzSWqL+9i69HP2ZYZ/YRr2d4lJ/lXBtv8xFrw9tVHnx Received: from SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.87]) by SNV-EXBH01.ds.corp.yahoo.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 26 Jun 2009 11:01:25 -0700 Received: from 10.73.146.106 ([10.73.146.106]) by SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.84]) via Exchange Front-End Server snv-webmail.corp.yahoo.com ([207.126.227.60]) with Microsoft Exchange Server HTTP-DAV ; Fri, 26 Jun 2009 18:01:25 +0000 User-Agent: Microsoft-Entourage/12.19.0.090515 Date: Fri, 26 Jun 2009 11:01:24 -0700 Subject: Re: Doing MapReduce over Har files From: Mahadev Konar To: Message-ID: Thread-Topic: Doing MapReduce over Har files Thread-Index: Acn2iB47GKH5Gv+59kihG+ZeqAN8ZA== In-Reply-To: <24217500.post@talk.nabble.com> Mime-version: 1.0 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable X-OriginalArrivalTime: 26 Jun 2009 18:01:25.0728 (UTC) FILETIME=[1F439E00:01C9F688] X-Virus-Checked: Checked by ClamAV on apache.org Hi Roshan and Julian, The har file system can be used as a input filesystem. You can just provide the input to map reduce as har:///something/some.har , where some.har is your har archive. This way amp reduce will use har filesystem a= s an input. The only problem being that maps cannot run across logical files in har.=20 You can specify whatever input format these files have/had before you included them into har archives. The point being that har:/// can be used a= s a input filesystem for map reduce, which will give map reduce a view of logical files inside of har. Hope this helps. mahadev On 6/26/09 2:37 AM, "jchernandez" wrote: >=20 > I also need help with this. I need to know how to handle a HAR file when = it > is the input to a MapReduce task. How do we read the HAR file so we can w= ork > on the individual logical files? I suppose we need to create our own > InputFormat and RecordReader files, but I=B4m not sure how to proceed. >=20 > Julian=20 >=20 >=20 > Roshan James-3 wrote: >>=20 >> When I run map reduce task over a har file as the input, I see that the >> input splits refer to 64mb byte boundaries inside the part file. >>=20 >> My mappers only know how to process the contents of each logical file >> inside >> the har file. Is there some way by which I can take the offset range >> specified by the input split and determine which logical files lie in th= at >> offset range? (How else would one do map reduce over a har file?) >>=20 >> Roshan >>=20 >>=20