hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Loddengaard <a...@cloudera.com>
Subject Re: BytesWritable get() returns more bytes then what's stored
Date Thu, 09 Apr 2009 03:03:36 GMT
FYI: this (open) JIRA might be interesting to you:

<http://issues.apache.org/jira/browse/HADOOP-3788>

Alex

On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon <todd@cloudera.com> wrote:

> On Wed, Apr 8, 2009 at 7:14 PM, bzheng <bing.zheng@gmail.com> wrote:
>
> >
> > Thanks for the clarification.  Though I still find it strange why not
> have
> > the get() method return what's actually stored regardless of buffer size.
> > Is there any reason why you'd want to use/examine what's in the buffer?
> >
>
> Because doing so requires an array copy. It's important for hadoop
> performance to avoid needless copies of data when they're unnecessary. Most
> APIs that take byte[] arrays have a version that includes an offset and
> length.
>
> -Todd
>
>
>
> >
> >
> > Todd Lipcon-4 wrote:
> > >
> > > Hi Bing,
> > >
> > > The issue here is that BytesWritable uses an internal buffer which is
> > > grown
> > > but not shrunk. The cause of this is that Writables in general are
> single
> > > instances that are shared across multiple input records. If you look at
> > > the
> > > internals of the input reader, you'll see that a single BytesWritable
> is
> > > instantiated, and then each time a record is read, it's read into that
> > > same
> > > instance. The purpose here is to avoid the allocation cost for each
> row.
> > >
> > > The end result is, as you've seen, that getBytes() returns an array
> which
> > > may be larger than the actual amount of data. In fact, the extra bytes
> > > (between .getSize() and .get().length) have undefined contents, not
> zero.
> > >
> > > Unfortunately, if the protobuffer API doesn't allow you to deserialize
> > out
> > > of a smaller portion of a byte array, you're out of luck and will have
> to
> > > do
> > > the copy like you've mentioned. I imagine, though, that there's some
> way
> > > around this in the protobuffer API - perhaps you can use a
> > > ByteArrayInputStream here to your advantage.
> > >
> > > Hope that helps
> > > -Todd
> > >
> > > On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bing.zheng@gmail.com> wrote:
> > >
> > >>
> > >> I tried to store protocolbuffer as BytesWritable in a sequence file
> > >> <Text,
> > >> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key),
> > new
> > >> BytesWritable(protobuf.convertToBytes())).  When reading the values
> from
> > >> key/value pairs using value.get(), it returns more then what's stored.
> > >> However, value.getSize() returns the correct number.  This means in
> > order
> > >> to
> > >> convert the byte[] to protocol buffer again, I have to do
> > >> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
> > >> version
> > >> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes
> for
> > >> a
> > >> few entries in the sequence file below.  The extra bytes in
> value.get()
> > >> all
> > >> have values of zero.
> > >>
> > >> value.getSize(): 7066   value.get().length: 10599
> > >> value.getSize(): 36456  value.get().length: 54684
> > >> value.getSize(): 32275  value.get().length: 54684
> > >> value.getSize(): 40561  value.get().length: 54684
> > >> value.getSize(): 16855  value.get().length: 54684
> > >> value.getSize(): 66304  value.get().length: 99456
> > >> value.getSize(): 26488  value.get().length: 99456
> > >> value.getSize(): 59327  value.get().length: 99456
> > >> value.getSize(): 36865  value.get().length: 99456
> > >>
> > >> --
> > >> View this message in context:
> > >>
> >
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> > >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message