Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Thu, 20 Oct 2011 09:44:10 +0000 (UTC)
From: "Dieter Plaetinck (Commented) (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: 
 <250201519.14791.1319103850713.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1209004013.14750.1319103250988.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HADOOP-7760) BytesWritable / SequenceFile
 yields dummy linefeed at end as soon as content has one or more linefeeds.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131489#comment-13131489 ] 

Dieter Plaetinck commented on HADOOP-7760:
------------------------------------------

Almost forgot, here is the output of a run of the test program:

$ java SequenceFileTest
== testing entry with one newline char ==
-> writing sequencefile with 1 record, which is a value with 1 newlines
11/10/20 11:13:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11/10/20 11:13:07 INFO compress.CodecPool: Got brand-new compressor
-> reading all sequencefile entries..
11/10/20 11:13:07 INFO compress.CodecPool: Got brand-new decompressor
--> reading a record
--> key: 1
--> value read line: 
== testing entry with two newline chars ==
-> writing sequencefile with 1 record, which is a value with 2 newlines
-> reading all sequencefile entries..
--> reading a record
--> key: 1
--> value read line: 
--> value read line: 
--> value read line: 

                
> BytesWritable / SequenceFile yields dummy linefeed at end as soon as content has one or more linefeeds.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-7760
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7760
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: record
>    Affects Versions: 0.20.2
>         Environment: Easily reproducable on Debian Linux cluster but also on my Arch Linux desktop.
> I am aware there are some newer releases in the 0.20 series, but all changelogs and release note links for those @ http://hadoop.apache.org/common/releases.html are broken, so I can't check if this has been fixed and/or whether it's safe to upgrade.
>            Reporter: Dieter Plaetinck
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I create SequenceFiles which have BytesWritable as values.
> I notice that if I store content which contains no linefeeds ("\n") or one linefeed, in the value, the value can also be read out of the sequencefile properly.
> However, as soon as I store input which contains two or more linefeeds (which is actually pretty much always the case), during the process of writing to the sequencefile and reading my data back, one *extra* linefeed is yielded at the end of the value, a linefeed which did not exist in the input.
> So this effectively corrupts my data, although i could write a hacky workaround for it.
> I have written a program that demonstrates the behavior, by showing what happens when writing 2 sequencefiles:
> one that has a record which value contains one linefeeds.
> another that has a record which value contains two linefeeds.
> Upon reading, the latter value will contain 3 linefeeds.
> Test file is : http://pastie.org/2728797

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira