hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmed Radwan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-2254) Allow setting of end-of-record delimiter for TextInputFormat
Date Thu, 13 Jan 2011 19:51:49 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981447#action_12981447
] 

Ahmed Radwan commented on MAPREDUCE-2254:
-----------------------------------------

Hi Todd. I agree that the changes can directly go to the LineReader. My motive was keeping
the LineReader mostly unchanged, in case it is used in other contexts. The LineReader breaks
the input stream using new lines, which is totally fine and it exactly does what its name
suggests. This is why I thought of encapsulating the changes within the RecordReader (where
conceptually these changes are required). However, I see your point that it looks a little
weird. I can move the changes to LineReader but then its name will not convey its functionality,
and if we rename it, this can cause other problems. What do you think? 


> Allow setting of end-of-record delimiter for TextInputFormat
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-2254
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2254
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Ahmed Radwan
>         Attachments: MAPREDUCE-2245.patch
>
>
> It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The
current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters.
This is a problem if users have embedded newlines in their data fields (which is pretty common).
This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836
and https://issues.cloudera.org/browse/SQOOP-136).
> I have wrote a patch to address this issue. This patch allows users to specify any custom
end-of-record delimiter using a new added configuration property. For backward compatibility,
if this new configuration property is absent, then the same exact previous delimiters are
used (i.e., '\n', '\r' or '\r\n').

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message