Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5827A7903 for ; Thu, 20 Oct 2011 09:44:39 +0000 (UTC) Received: (qmail 64483 invoked by uid 500); 20 Oct 2011 09:44:36 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 63897 invoked by uid 500); 20 Oct 2011 09:44:33 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 63827 invoked by uid 99); 20 Oct 2011 09:44:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Oct 2011 09:44:32 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Oct 2011 09:44:30 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id ADB8A313F04 for ; Thu, 20 Oct 2011 09:44:10 +0000 (UTC) Date: Thu, 20 Oct 2011 09:44:10 +0000 (UTC) From: "Dieter Plaetinck (Commented) (JIRA)" To: common-issues@hadoop.apache.org Message-ID: <250201519.14791.1319103850713.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1209004013.14750.1319103250988.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HADOOP-7760) BytesWritable / SequenceFile yields dummy linefeed at end as soon as content has one or more linefeeds. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-7760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131489#comment-13131489 ] Dieter Plaetinck commented on HADOOP-7760: ------------------------------------------ Almost forgot, here is the output of a run of the test program: $ java SequenceFileTest == testing entry with one newline char == -> writing sequencefile with 1 record, which is a value with 1 newlines 11/10/20 11:13:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 11/10/20 11:13:07 INFO compress.CodecPool: Got brand-new compressor -> reading all sequencefile entries.. 11/10/20 11:13:07 INFO compress.CodecPool: Got brand-new decompressor --> reading a record --> key: 1 --> value read line: == testing entry with two newline chars == -> writing sequencefile with 1 record, which is a value with 2 newlines -> reading all sequencefile entries.. --> reading a record --> key: 1 --> value read line: --> value read line: --> value read line: > BytesWritable / SequenceFile yields dummy linefeed at end as soon as content has one or more linefeeds. > ------------------------------------------------------------------------------------------------------- > > Key: HADOOP-7760 > URL: https://issues.apache.org/jira/browse/HADOOP-7760 > Project: Hadoop Common > Issue Type: Bug > Components: record > Affects Versions: 0.20.2 > Environment: Easily reproducable on Debian Linux cluster but also on my Arch Linux desktop. > I am aware there are some newer releases in the 0.20 series, but all changelogs and release note links for those @ http://hadoop.apache.org/common/releases.html are broken, so I can't check if this has been fixed and/or whether it's safe to upgrade. > Reporter: Dieter Plaetinck > Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > I create SequenceFiles which have BytesWritable as values. > I notice that if I store content which contains no linefeeds ("\n") or one linefeed, in the value, the value can also be read out of the sequencefile properly. > However, as soon as I store input which contains two or more linefeeds (which is actually pretty much always the case), during the process of writing to the sequencefile and reading my data back, one *extra* linefeed is yielded at the end of the value, a linefeed which did not exist in the input. > So this effectively corrupts my data, although i could write a hacky workaround for it. > I have written a program that demonstrates the behavior, by showing what happens when writing 2 sequencefiles: > one that has a record which value contains one linefeeds. > another that has a record which value contains two linefeeds. > Upon reading, the latter value will contain 3 linefeeds. > Test file is : http://pastie.org/2728797 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira