hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laukik Chitnis (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()
Date Fri, 30 Jan 2009 01:07:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668688#action_12668688
] 

Laukik Chitnis commented on PIG-560:
------------------------------------

The writeUTF() method was adding 2 bytes per string; we would actually be adding an int (32
bits) with this solution.

The new long string would then be required to be a new DataType, right? To make it transparent
to the user, this DataType can just be used internally. Also, to keep things efficient, may
be we can insert the string as this datatype only on getting the encoded-string-too-long 
UTFDataFormatException.

By the way, though it looks quite probable that the average length of a string used would
be far less than 64k, do we have any statistic on the average length of (UTF converted) CHARARRAYs?
This would also help us in determining how big an overhead the additional 16 bits actually
is. 

> UTFDataFormatException (encoded string too long) is thrown when storing strings >
65536 bytes (in UTF8 form) using BinStorage()
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-560
>                 URL: https://issues.apache.org/jira/browse/PIG-560
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to write out
Strings as UTF-8 bytes and to read them back. From the Javadoc - "First, the total number
of bytes needed to represent all the characters of s is calculated. If this number is larger
than 65535, then a UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2
bytes to represent the number of bytes). A way to get around this would be to not use writeUTF()/ReadUTF()
and instead hand convert the string to the corresponding UTF-8 byte[]  (using String.getBytes("UTF-8")
and then write the length of the byte array as an int - this will allow a size of upto 2^32
(2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message