hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: How does mapper process partial records?
Date Fri, 25 Jan 2013 08:50:14 GMT
I don't quite get what you mean - we don't have such a flaw. The first
split task makes sure to read one extra record, even if its last byte
is a newline. The subsequent splits (that is, those with offsets not
0), always ignore the first record even if it is complete in their
given range.

You may read the implementation by following the sources I've linked
here: http://search-hadoop.com/m/veN7E1gWbij/linereader&subj=Re+DFS+and+the+RecordReader
from similar questions asked in past.

On Fri, Jan 25, 2013 at 6:07 AM, Praveen Sripati
<praveensripati@gmail.com> wrote:
> Harsh,
>
> Thanks for the response.
>
> From http://wiki.apache.org/hadoop/HadoopMapReduce
>
>>For example TextInputFormat will read the last line of the FileSplit past
>> the split boundary and when reading other than the first FileSplit,
>> TextInputFormat ignores the content up to the first newline.
>
> When the first record in the splits other than the first split is complete
> and not spanning boundaries, then based on the above logic this particular
> record is not processed by the mapper.
>
>
> Thanks,
> Praveen
>
> Cloudera Certified Developer for Apache Hadoop CDH4 (95%)
> http://www.thecloudavenue.com/
> http://stackoverflow.com/users/614157/praveen-sripati
>
> If you aren’t taking advantage of big data, then you don’t have big data,
> you have just a pile of data.
>
>
> On Fri, Jan 25, 2013 at 12:52 AM, Harsh J <harsh@cloudera.com> wrote:
>>
>> Hi Praveen,
>>
>> This is explained at http://wiki.apache.org/hadoop/HadoopMapReduce
>> [Map section].
>>
>> On Thu, Jan 24, 2013 at 10:20 PM, Praveen Sripati
>> <praveensripati@gmail.com> wrote:
>> > Hi,
>> >
>> > HDFS splits the file across record boundaries. So, how does the mapper
>> > processing the second block (b2) determine that the first record is
>> > incomplete and should process starting from the second record in the
>> > block
>> > (b2)?
>> >
>> > Thanks,
>> > Praveen
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Mime
View raw message