hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahadev Konar <maha...@yahoo-inc.com>
Subject Re: Doing MapReduce over Har files
Date Fri, 26 Jun 2009 18:01:24 GMT
Hi Roshan and Julian,
  The har file system can be used as a input filesystem. You can just
provide the input to map reduce as har:///something/some.har , where
some.har is your har archive. This way amp reduce will use har filesystem as
an input. The only problem being that maps cannot run across logical files
in har. 

You can specify whatever input  format these files have/had before you
included them into har archives. The point being that har:/// can be used as
a input filesystem for map reduce, which will give map reduce a view of
logical files inside of har.

Hope this helps.

On 6/26/09 2:37 AM, "jchernandez" <jchernandez@agnitio.es> wrote:

> I also need help with this. I need to know how to handle a HAR file when it
> is the input to a MapReduce task. How do we read the HAR file so we can work
> on the individual logical files? I suppose we need to create our own
> InputFormat and RecordReader files, but I´m not sure how to proceed.
> Julian 
> Roshan James-3 wrote:
>> When I run map reduce task over a har file as the input, I see that the
>> input splits refer to 64mb byte boundaries inside the part file.
>> My mappers only know how to process the contents of each logical file
>> inside
>> the har file. Is there some way by which I can take the offset range
>> specified by the input split and determine which logical files lie in that
>> offset range? (How else would one do map reduce over a har file?)
>> Roshan

View raw message