hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Hadoop and XML
Date Tue, 20 Jul 2010 18:29:08 GMT

On Jul 20, 2010, at 11:24 AM, Scott Carey wrote:

>> This sounds like a bug.
>> Let's say you create a Text object and drop in a String that sets the byte array
length to 200.  Then drop in a a second String that sets the byte array length to 500.  Since,
the new length is greater than the previous length; the byte array length is reset to the
longer length.  Now, if you drop in a third String that would set the byte array length to
350; the Text object does not replace the byte array with a new length of 350; it utilizes
the greater length of 500 and sets an extra variable to track the "real" length.
>> So: Text.getBytes().length != Text.getLength()
>> This does 2 things:
>> 1. Passes around more data than what is needed
>> 2. Makes the Text object confusing to work with
>> Text.getBytes().length == Text.getLength() - should be the correct behavior.
> I don't think so.  Passing around byte arrays larger than the valid data is common practice
in Java for performance reasons.  Hence, the common method signature containing  (byte[] bytes,
int len, int offset) and similar.   Creating a new byte array for each resize defeats the
purpose of re-using the byte array and the Text object -- lower memory allocation and improved
CPU cache locality.  The byte array here is a buffer, it does not represent the entire string.

To be more specific here, shouldn't Text.toString() do the trick?   If Text.toString() doesn't
work and does something other than what you expect here, it should be documented and that
class should have another helper method that gets you a String from Text.   Calling getBytes()
and manually constructing a string means you should know what those bytes represent -- a buffer
where the bytes for the string are from index - to Text.getLength().
View raw message