Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 95129 invoked from network); 9 Apr 2009 03:04:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Apr 2009 03:04:09 -0000 Received: (qmail 5256 invoked by uid 500); 9 Apr 2009 03:04:07 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 5153 invoked by uid 500); 9 Apr 2009 03:04:06 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 5143 invoked by uid 99); 9 Apr 2009 03:04:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2009 03:04:06 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.46.31] (HELO yw-out-2324.google.com) (74.125.46.31) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2009 03:03:57 +0000 Received: by yw-out-2324.google.com with SMTP id 2so282808ywt.29 for ; Wed, 08 Apr 2009 20:03:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.101.71.3 with SMTP id y3mr4585836ank.62.1239246216345; Wed, 08 Apr 2009 20:03:36 -0700 (PDT) In-Reply-To: <45f85f70904081918j557bba4br2811752f4b8fbee5@mail.gmail.com> References: <22962146.post@talk.nabble.com> <45f85f70904081713w65e67d05l239cda3719b770b@mail.gmail.com> <22963309.post@talk.nabble.com> <45f85f70904081918j557bba4br2811752f4b8fbee5@mail.gmail.com> Date: Wed, 8 Apr 2009 20:03:36 -0700 Message-ID: <623d9cf40904082003keee5eas460d80f10a00e4b@mail.gmail.com> Subject: Re: BytesWritable get() returns more bytes then what's stored From: Alex Loddengaard To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016368e1b1183ba8e0467167e4b X-Virus-Checked: Checked by ClamAV on apache.org --0016368e1b1183ba8e0467167e4b Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit FYI: this (open) JIRA might be interesting to you: Alex On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon wrote: > On Wed, Apr 8, 2009 at 7:14 PM, bzheng wrote: > > > > > Thanks for the clarification. Though I still find it strange why not > have > > the get() method return what's actually stored regardless of buffer size. > > Is there any reason why you'd want to use/examine what's in the buffer? > > > > Because doing so requires an array copy. It's important for hadoop > performance to avoid needless copies of data when they're unnecessary. Most > APIs that take byte[] arrays have a version that includes an offset and > length. > > -Todd > > > > > > > > > Todd Lipcon-4 wrote: > > > > > > Hi Bing, > > > > > > The issue here is that BytesWritable uses an internal buffer which is > > > grown > > > but not shrunk. The cause of this is that Writables in general are > single > > > instances that are shared across multiple input records. If you look at > > > the > > > internals of the input reader, you'll see that a single BytesWritable > is > > > instantiated, and then each time a record is read, it's read into that > > > same > > > instance. The purpose here is to avoid the allocation cost for each > row. > > > > > > The end result is, as you've seen, that getBytes() returns an array > which > > > may be larger than the actual amount of data. In fact, the extra bytes > > > (between .getSize() and .get().length) have undefined contents, not > zero. > > > > > > Unfortunately, if the protobuffer API doesn't allow you to deserialize > > out > > > of a smaller portion of a byte array, you're out of luck and will have > to > > > do > > > the copy like you've mentioned. I imagine, though, that there's some > way > > > around this in the protobuffer API - perhaps you can use a > > > ByteArrayInputStream here to your advantage. > > > > > > Hope that helps > > > -Todd > > > > > > On Wed, Apr 8, 2009 at 4:59 PM, bzheng wrote: > > > > > >> > > >> I tried to store protocolbuffer as BytesWritable in a sequence file > > >> > >> BytesWritable>. It's stored using SequenceFile.Writer(new Text(key), > > new > > >> BytesWritable(protobuf.convertToBytes())). When reading the values > from > > >> key/value pairs using value.get(), it returns more then what's stored. > > >> However, value.getSize() returns the correct number. This means in > > order > > >> to > > >> convert the byte[] to protocol buffer again, I have to do > > >> Arrays.copyOf(value.get(), value.getSize()). This happens on both > > >> version > > >> 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes > for > > >> a > > >> few entries in the sequence file below. The extra bytes in > value.get() > > >> all > > >> have values of zero. > > >> > > >> value.getSize(): 7066 value.get().length: 10599 > > >> value.getSize(): 36456 value.get().length: 54684 > > >> value.getSize(): 32275 value.get().length: 54684 > > >> value.getSize(): 40561 value.get().length: 54684 > > >> value.getSize(): 16855 value.get().length: 54684 > > >> value.getSize(): 66304 value.get().length: 99456 > > >> value.getSize(): 26488 value.get().length: 99456 > > >> value.getSize(): 59327 value.get().length: 99456 > > >> value.getSize(): 36865 value.get().length: 99456 > > >> > > >> -- > > >> View this message in context: > > >> > > > http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html > > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > > >> > > >> > > > > > > > > > > -- > > View this message in context: > > > http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > --0016368e1b1183ba8e0467167e4b--