hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: BytesWritable get() returns more bytes then what's stored
Date Thu, 09 Apr 2009 02:18:52 GMT
On Wed, Apr 8, 2009 at 7:14 PM, bzheng <bing.zheng@gmail.com> wrote:

>
> Thanks for the clarification.  Though I still find it strange why not have
> the get() method return what's actually stored regardless of buffer size.
> Is there any reason why you'd want to use/examine what's in the buffer?
>

Because doing so requires an array copy. It's important for hadoop
performance to avoid needless copies of data when they're unnecessary. Most
APIs that take byte[] arrays have a version that includes an offset and
length.

-Todd



>
>
> Todd Lipcon-4 wrote:
> >
> > Hi Bing,
> >
> > The issue here is that BytesWritable uses an internal buffer which is
> > grown
> > but not shrunk. The cause of this is that Writables in general are single
> > instances that are shared across multiple input records. If you look at
> > the
> > internals of the input reader, you'll see that a single BytesWritable is
> > instantiated, and then each time a record is read, it's read into that
> > same
> > instance. The purpose here is to avoid the allocation cost for each row.
> >
> > The end result is, as you've seen, that getBytes() returns an array which
> > may be larger than the actual amount of data. In fact, the extra bytes
> > (between .getSize() and .get().length) have undefined contents, not zero.
> >
> > Unfortunately, if the protobuffer API doesn't allow you to deserialize
> out
> > of a smaller portion of a byte array, you're out of luck and will have to
> > do
> > the copy like you've mentioned. I imagine, though, that there's some way
> > around this in the protobuffer API - perhaps you can use a
> > ByteArrayInputStream here to your advantage.
> >
> > Hope that helps
> > -Todd
> >
> > On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bing.zheng@gmail.com> wrote:
> >
> >>
> >> I tried to store protocolbuffer as BytesWritable in a sequence file
> >> <Text,
> >> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key),
> new
> >> BytesWritable(protobuf.convertToBytes())).  When reading the values from
> >> key/value pairs using value.get(), it returns more then what's stored.
> >> However, value.getSize() returns the correct number.  This means in
> order
> >> to
> >> convert the byte[] to protocol buffer again, I have to do
> >> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
> >> version
> >> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
> >> a
> >> few entries in the sequence file below.  The extra bytes in value.get()
> >> all
> >> have values of zero.
> >>
> >> value.getSize(): 7066   value.get().length: 10599
> >> value.getSize(): 36456  value.get().length: 54684
> >> value.getSize(): 32275  value.get().length: 54684
> >> value.getSize(): 40561  value.get().length: 54684
> >> value.getSize(): 16855  value.get().length: 54684
> >> value.getSize(): 66304  value.get().length: 99456
> >> value.getSize(): 26488  value.get().length: 99456
> >> value.getSize(): 59327  value.get().length: 99456
> >> value.getSize(): 36865  value.get().length: 99456
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message