Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 49262 invoked from network); 31 Jan 2009 01:01:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Jan 2009 01:01:21 -0000 Received: (qmail 84614 invoked by uid 500); 31 Jan 2009 01:01:21 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 84551 invoked by uid 500); 31 Jan 2009 01:01:21 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 84539 invoked by uid 99); 31 Jan 2009 01:01:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jan 2009 17:01:21 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 31 Jan 2009 01:01:20 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 93AF0234C4AD for ; Fri, 30 Jan 2009 17:00:59 -0800 (PST) Message-ID: <515898871.1233363659603.JavaMail.jira@brutus> Date: Fri, 30 Jan 2009 17:00:59 -0800 (PST) From: "Santhosh Srinivasan (JIRA)" To: pig-dev@hadoop.apache.org Subject: [jira] Updated: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage() In-Reply-To: <384465971.1229023124452.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Srinivasan updated PIG-560: ------------------------------------ Attachment: PIG-560_1.patch Incorporated comments from Laukik. Submitting a new patch and modified test case. Running the tests now. > UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage() > ------------------------------------------------------------------------------------------------------------------------------- > > Key: PIG-560 > URL: https://issues.apache.org/jira/browse/PIG-560 > Project: Pig > Issue Type: Bug > Affects Versions: types_branch > Reporter: Pradeep Kamath > Fix For: types_branch > > Attachments: PIG-560.patch, PIG-560_1.patch, utf-limit-patch.diff > > > BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number of bytes needed to represent all the characters of s is calculated. If this number is larger than 65535, then a UTFDataFormatException is thrown. " (because the writeUTF() API uses 2 bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF() and instead hand convert the string to the corresponding UTF-8 byte[] (using String.getBytes("UTF-8") and then write the length of the byte array as an int - this will allow a size of upto 2^32 (2 raised to 32). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.