hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Hadoop and XML
Date Tue, 20 Jul 2010 18:24:14 GMT

> This sounds like a bug.
> Let's say you create a Text object and drop in a String that sets the byte array length
to 200.  Then drop in a a second String that sets the byte array length to 500.  Since, the
new length is greater than the previous length; the byte array length is reset to the longer
length.  Now, if you drop in a third String that would set the byte array length to 350; the
Text object does not replace the byte array with a new length of 350; it utilizes the greater
length of 500 and sets an extra variable to track the "real" length.
> So: Text.getBytes().length != Text.getLength()
> This does 2 things:
> 1. Passes around more data than what is needed
> 2. Makes the Text object confusing to work with
> Text.getBytes().length == Text.getLength() - should be the correct behavior.

I don't think so.  Passing around byte arrays larger than the valid data is common practice
in Java for performance reasons.  Hence, the common method signature containing  (byte[] bytes,
int len, int offset) and similar.   Creating a new byte array for each resize defeats the
purpose of re-using the byte array and the Text object -- lower memory allocation and improved
CPU cache locality.  The byte array here is a buffer, it does not represent the entire string.

View raw message