hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: re-reading
Date Wed, 08 Jun 2011 16:48:14 GMT
Mark,

The InputSplit is something of a meta class you ought to use to get
path, offset and length information from. Your RecordReader
implementation in the InputFormat would ideally be wrapping two
instantiated RecordReaders made from the same InputSplit meta
information. The InputSplit object does not serve any more purpose
beyond that (and there should be no need to clone/copy it -- just
extract the information you require from the FileSplit).

On Wed, Jun 8, 2011 at 10:08 PM, Mark question <markq2011@gmail.com> wrote:
> I have a question though for Harsh case... I wrote my custom inputFormat
> which will create an array of recordReaders and give them to the MapRunner.
>
> Will that mean multiple copies of the inputSplit are all in memory? or will
> there be one copy pointed by all of them .. as if they were pointers ?
>
> Thanks,
> Mark
>
> On Wed, Jun 8, 2011 at 9:13 AM, Mark question <markq2011@gmail.com> wrote:
>
>> Thanks for the replies, but input doesn't have 'clone' I don't know why ...
>> so I'll have to write my custom inputFormat ... I was hoping for an easier
>> way though.
>>
>> Thank you,
>> Mark
>>
>>
>> On Wed, Jun 8, 2011 at 1:58 AM, Harsh J <harsh@cloudera.com> wrote:
>>
>>> Or if that does not work for any reason (haven't tried it really), try
>>> writing your own InputFormat wrapper where in you can have direct
>>> access to the InputSplit object to do what you want to (open two
>>> record readers, and manage them separately).
>>>
>>> On Wed, Jun 8, 2011 at 1:48 PM, Stefan Wienert <stefan@wienert.cc> wrote:
>>> > Try input.clone()...
>>> >
>>> > 2011/6/8 Mark question <markq2011@gmail.com>:
>>> >> Hi,
>>> >>
>>> >>   I'm trying to read the inputSplit over and over using following
>>> function
>>> >> in MapperRunner:
>>> >>
>>> >> @Override
>>> >>    public void run(RecordReader input, OutputCollector output, Reporter
>>> >> reporter) throws IOException {
>>> >>
>>> >>   RecordReader copyInput = input;
>>> >>
>>> >>  //First read
>>> >>   while(input.next(key,value));
>>> >>
>>> >>  //Second read
>>> >>  while(copyInput.next(key,value));
>>> >>   }
>>> >>
>>> >> It can clearly be seen that this won't work because both RecordReaders
>>> are
>>> >> actually the same. I'm trying to find a way for the second reader to
>>> start
>>> >> reading the split again from beginning ... How can I do that?
>>> >>
>>> >> Thanks,
>>> >> Mark
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Stefan Wienert
>>> >
>>> > http://www.wienert.cc
>>> > stefan@wienert.cc
>>> >
>>> > Telefon: +495251-2026838
>>> > Mobil: +49176-40170270
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>



-- 
Harsh J

Mime
View raw message