hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gelesh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4519) In TextInputFormat, while specifying textinputformat.record.delimiter the character/character sequences in data file similar to starting character/starting character sequence in delimiter were found missing in certain cases in the Map Output
Date Mon, 06 Aug 2012 11:15:02 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429082#comment-13429082
] 

Gelesh commented on MAPREDUCE-4519:
-----------------------------------

I have found a similar Bug And a fix, MAPREDUCE-4512. Please reffer the patch, and kindly
encorporate the same.
While fixing I too have encounted such a senario, I think this occur at the end of the buffer
which would capture 4096 Charactors.
My understanding is the ending and begining of next buffer can and the delimiter indexses
are not properly handled.
This is resulting in some or the other bugs.

Tried solving , but the fix resulted in some new bugs. The once all the senario is caught
we can ensure a posible fix.
                
> In TextInputFormat, while specifying textinputformat.record.delimiter the character/character
sequences in data file similar to starting character/starting character sequence in delimiter
were found missing in certain cases in the Map Output
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4519
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4519
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: Linux- Ubuntu 10.04
>            Reporter: Arun A K
>              Labels: hadoop, mapreduce, textinputformat, textinputformat.record.delimiter
>             Fix For: 0.20.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Set textinputformat.record.delimiter as "</entity>"
> Suppose the input is a text file with the following content
> <entity><id>1</id><name>User1</name></entity><entity><id>2</id><name>User2</name></entity><entity><id>3</id><name>User3</name></entity><entity><id>4</id><name>User4</name></entity><entity><id>5</id><name>User5</name></entity>
> Mapper was expected to get value as 
> Value 1 - <entity><id>1</id><name>User1</name>
> Value 2 - <entity><id>2</id><name>User2</name>
> Value 3 - <entity><id>3</id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4</name>
> Value 5 - <entity><id>5</id><name>User5</name>
> According to this bug Mapper gets value
> Value 1 - entity><id>1</id><name>User1</name>
> Value 2 - <entity>id>2</id><name>User2</name>
> Value 3 - <entity><id>3id><name>User3</name>
> Value 4 - <entity><id>4</id><name>User4name>
> Value 5 - <entity><id>5</id><name>User5</name>
> The pattern shown above need not occur for value 1,2,3 necessarily. The bug occurs at
some random positions in the map input.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message