Return-Path: Delivered-To: apmail-hadoop-avro-dev-archive@minotaur.apache.org Received: (qmail 76390 invoked from network); 10 Feb 2010 20:13:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Feb 2010 20:13:41 -0000 Received: (qmail 36534 invoked by uid 500); 10 Feb 2010 20:13:40 -0000 Delivered-To: apmail-hadoop-avro-dev-archive@hadoop.apache.org Received: (qmail 36424 invoked by uid 500); 10 Feb 2010 20:13:40 -0000 Mailing-List: contact avro-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: avro-dev@hadoop.apache.org Delivered-To: mailing list avro-dev@hadoop.apache.org Received: (qmail 36403 invoked by uid 99); 10 Feb 2010 20:13:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Feb 2010 20:13:40 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Feb 2010 20:13:38 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 41747234C48D for ; Wed, 10 Feb 2010 12:13:18 -0800 (PST) Message-ID: <208692351.188981265832798267.JavaMail.jira@brutus.apache.org> Date: Wed, 10 Feb 2010 20:13:18 +0000 (UTC) From: "Scott Carey (JIRA)" To: avro-dev@hadoop.apache.org Subject: [jira] Commented: (AVRO-392) Binary Decoder Performance and flexibility overhaul In-Reply-To: <445676622.221265194168898.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AVRO-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832166#action_12832166 ] Scott Carey commented on AVRO-392: ---------------------------------- bq. Do we distinguish between malformed data (a varint that hasn't finished) and a genuine EOF? I think we don't need to (that's the job of the level high up), but just want to make sure. We might want to introduce something like a 'BinaryFormatException' to deal with the "this can't be an encoded int" -like cases. but we can do that in another ticket. bq. Why is BufferAccessor static? It seems that it attaches itself only ever to a BinaryDecoder object. I don't know that it makes a difference; might be just a style preference. 2 things. * Style -- Since a non-static private class has a hidden private member 'this' that points to the parent object, it has more overhead. So by default I start with static and remove it if needed. * Flexibility -- I wasn't sure if I'd want to share a source or 'reattach' one. For example, two decoders could share one source. Once it was coded to work as static, I left it that way. With the current use cases, it doesn't matter much. {quote}I'm a bit concerned that we haven't tested all of the edges cases here. Introducing buffering into the Decoder has certainly increased the code complexity. We could run a code coverage tool to see if all the new code is touched by the existing TestBinaryDecoder code. I didn't see (though I only had a chance to do a quick pass) tests that detect EOFExceptions on the tricky ensureBounds() codepath. Likewise, I'd be more comfortable if the two variants of ByteSource had their own unit tests.{quote} I've used EclEmma in eclipse to generate code coverage reports. The main parts missing (counting coverage from TestBinaryDecoder and TestDataFile) are: * Invalid Int/Long encoding. * Most skipXX methods. * BufferAccessor when detached. * .inputStream() read and skip methods. I'll add some tests to cover these. Several of the above are covered in other tests (for example, the skipXXX methods). > Binary Decoder Performance and flexibility overhaul > --------------------------------------------------- > > Key: AVRO-392 > URL: https://issues.apache.org/jira/browse/AVRO-392 > Project: Avro > Issue Type: Improvement > Components: java > Reporter: Scott Carey > Assignee: Scott Carey > Priority: Critical > Fix For: 1.3.0 > > Attachments: AVRO-392-preview.patch, AVRO-392.patch, AVRO-392.patch, AVRO-392.patch > > > BinaryDecoder has room for significant performance improvement. [AVRO-327|https://issues.apache.org/jira/browse/AVRO-327] has some preliminary work here, but in order to satisfy some use cases there is much more work to do. > I am opening a new ticket because the scope of the changes needed to do this the right way are larger. > I have done a large bulk of a new implementation that abstracts a 'ByteSource' from the BinaryDecoder. Currently BinaryDecoder is tightly coupled to InputStream. The ByteSource can wrap an InputStream, FileChannel, or byte[] in this version, but could be extended to support other channel types, sockets, etc. This abstraction allows the BinaryDecoder to buffer data from various sources while supporting interleaved access to the underlying data and greater flexibility going forward. > The performance of this abstraction has been heavily tuned so that maximum performance can be achieved even for slower ByteSource implementations. > For readers that must interleave reads on a stream with the decoder, this includes a > {code} > public InputStream inputStream(); > {code} > method on the decoder that can serve interleaved reads. > Additionally it will be necessary to have a constructor on BinaryDecoder that allows two BinaryDecoders to share a stream (and buffer). > Performance results on this new version is better than previous prototypes: > *current trunk BinaryDecoder* > {noformat} > ReadInt: 983 ms, 30.497877855999185 million entries/sec > ReadLongSmall: 1058 ms, 28.336666040111496 million entries/sec > ReadLong: 1518 ms, 19.75179889508437 million entries/sec > ReadFloat: 657 ms, 45.61031157924184 million entries/sec > ReadDouble: 761 ms, 39.387756709704355 million entries/sec > ReadBoolean: 331 ms, 90.4268145647456 million entries/sec > RepeaterTest: 7718 ms, 3.886725782038378 million entries/sec > NestedRecordTest: 1884 ms, 15.91964611687992 million entries/sec > ResolverTest: 8296 ms, 3.616055866616717 million entries/sec > MigrationTest: 21216 ms, 1.4139999570144013 million entries/sec > {noformat} > *buffering BinaryDecoder* > {noformat} > ReadInt: 187 ms, 160.22131904871262 million entries/sec > ReadLongSmall: 372 ms, 80.4863521975457 million entries/sec > ReadLong: 613 ms, 48.882385721129246 million entries/sec > ReadFloat: 253 ms, 118.16606270679061 million entries/sec > ReadDouble: 275 ms, 108.94314257389068 million entries/sec > ReadBoolean: 222 ms, 134.85327963176064 million entries/sec > RepeaterTest: 3335 ms, 8.993007936329503 million entries/sec > NestedRecordTest: 1152 ms, 26.0256943004597 million entries/sec > ResolverTest: 4213 ms, 7.120659335077578 million entries/sec > MigrationTest: 15310 ms, 1.9594884898992941 million entries/sec > {noformat} > Performance is 2x to 5x the throughput of trunk on most tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.