hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dieter Plaetinck (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7760) BytesWritable / SequenceFile yields dummy linefeed at end as soon as content has one or more linefeeds.
Date Fri, 21 Oct 2011 09:32:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132547#comment-13132547

Dieter Plaetinck commented on HADOOP-7760:

Whoops.  You're right. My mistake.  Not a bug.  This can be closed.
That said, I find it counter intuitive that ByteArrayInputStream needs to be told explicitly
to stop at the last element in the array, and by default goes one element too far?  Any thoughts
on that?

> BytesWritable / SequenceFile yields dummy linefeed at end as soon as content has one
or more linefeeds.
> -------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-7760
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7760
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: record
>    Affects Versions: 0.20.2
>         Environment: Easily reproducable on Debian Linux cluster but also on my Arch
Linux desktop.
> I am aware there are some newer releases in the 0.20 series, but all changelogs and release
note links for those @ http://hadoop.apache.org/common/releases.html are broken, so I can't
check if this has been fixed and/or whether it's safe to upgrade.
>            Reporter: Dieter Plaetinck
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
> I create SequenceFiles which have BytesWritable as values.
> I notice that if I store content which contains no linefeeds ("\n") or one linefeed,
in the value, the value can also be read out of the sequencefile properly.
> However, as soon as I store input which contains two or more linefeeds (which is actually
pretty much always the case), during the process of writing to the sequencefile and reading
my data back, one *extra* linefeed is yielded at the end of the value, a linefeed which did
not exist in the input.
> So this effectively corrupts my data, although i could write a hacky workaround for it.
> I have written a program that demonstrates the behavior, by showing what happens when
writing 2 sequencefiles:
> one that has a record which value contains one linefeeds.
> another that has a record which value contains two linefeeds.
> Upon reading, the latter value will contain 3 linefeeds.
> Test file is : http://pastie.org/2728797

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message