hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Bowen <dbo...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-1162) Record IO: seariliizing a byte buffer to CSV fails if buffer contains bytes less than 16.
Date Wed, 28 Mar 2007 00:38:02 GMT

> Oh, i misunderstood your question. I am switching buffer serialization to just plain
bytes except for 5 characters that are escaped (essentially similar to string serialization
as if the string were iso-8859-1.)
I'm not sure I follow.  I think string serialization implies UTF-8
encoding?  That means bytes in the range 128-255 would take 2 bytes.  If
we assume that in a byte buffer, all byte values are equally probable,
then the average space for CSV serialization, per byte, would be 1.5
bytes, or 12 bits.  Right?  Actually a little more because you escape
those 5 characters too.

So why not use base64 encoding?  The expansion factor would be less,
since it essentially uses 8 bits to represent 6.  Also, it omits control
characters which I think would be a problem with what you're suggesting
- we need the CSV files to be human readable, so I think you'd have to
escape them too.

Or else just leave the encoding as two hex digits per byte. 

View raw message