avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-392) Binary Decoder Performance and flexibility overhaul
Date Thu, 04 Feb 2010 22:15:27 GMT

    [ https://issues.apache.org/jira/browse/AVRO-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829818#action_12829818

Scott Carey commented on AVRO-392:

Thanks for the feedback Doug.

One at a time:

bq. please don't make whitespace-only changes, reorder imports or make other extraneous changes
that just make the patch harder to read.

I wish there was a way to back out of it, but I experimented with a few dozen variations on
almost every bit of BinaryDecoder, so its almost impossible to track down the white space
changes without a lot of trial and error.  I can fix the import changes.   DataFileReader
is also heavily modified.  The other classes I can clean up.

# the new JsonDecoder constuctors have bad javadoc.
# the BinaryDecoder javadoc has been lost

I'll clean up the javadoc.  Are there any places that could be better documented or explained?

bq. why is DataFileStream#vin now protected? can't we keep this package protected?

Not sure.  I'll change it back.

bq. does ByteSource need to be public? I'd prefer it were not.

Good catch.

# finally, you've added new Decoder constructors and init() methods for File and byte[], but,
as far as I can see, these are not yet used. can we separate this from the rest of this patch
to be considered in a separate issue? the different ByteSource implementations could be removed
for now too. if i'm not mistaken, you're getting the above speedups with InputStream-based
methods only. i'd prefer if this issue could focus on the minimal API changes needed to get
these performance improvements. if further performance improvements can be had with API changes,
let's do that separately.

The byte[] initializer and ByteSource is used for optimizations on file reading with NullCodec,
and for BinaryData$Decoders.set().  The File initializer and FileChannelByteSource is more
of a prototype.   I can move out for now since it is intended for a future optimization for
the DataFileReader, it might be best as a FileChannel anyway.    I did need to prototype the
other ByteSource variants to make sure the concept was flexible enough for future evolution.
 I'll move all these things to another patch for a different issue.

What about the isEnd() method on BinaryDecoder?  Should this be moved to Decoder?  Its public,
but it could be package protected if we move the classes in org.apache.avro.file into o.a.a.io.
 This would also allow ByteSource to be shared between file types, codecs, and decoders. 
I suspect there will be more things we want to share between classes in o.a.a.file and o.a.a.io
since these are so tightly related.

A heads-up:  the same approach will help BinaryEncoder, but we need a good set of tests first.
 That side of things should be easier because Encoder already has buffering semantics.

> Binary Decoder Performance and flexibility overhaul
> ---------------------------------------------------
>                 Key: AVRO-392
>                 URL: https://issues.apache.org/jira/browse/AVRO-392
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>            Priority: Critical
>             Fix For: 1.3.0
>         Attachments: AVRO-392-preview.patch, AVRO-392.patch
> BinaryDecoder has room for significant performance improvement.  [AVRO-327|https://issues.apache.org/jira/browse/AVRO-327]
has some preliminary work here, but in order to satisfy some use cases there is much more
work to do.
> I am opening a new ticket because the scope of the changes needed to do this the right
way are larger.
> I have done a large bulk of a new implementation that abstracts a 'ByteSource' from the
BinaryDecoder.  Currently BinaryDecoder is tightly coupled to InputStream.  The ByteSource
can wrap an InputStream, FileChannel, or byte[] in this version, but could be extended to
support other channel types, sockets, etc.  This abstraction allows the BinaryDecoder to buffer
data from various sources while supporting interleaved access to the underlying data and greater
flexibility going forward.
> The performance of this abstraction has been heavily tuned so that maximum performance
can be achieved even for slower ByteSource implementations.
> For readers that must interleave reads on a stream with the decoder, this includes a
> {code}
> public InputStream inputStream();
> {code}
> method on the decoder that can serve interleaved reads.  
> Additionally it will be necessary to have a constructor on BinaryDecoder that allows
two BinaryDecoders to share a stream (and buffer).
> Performance results on this new version is better than previous prototypes:
> *current trunk BinaryDecoder*
> {noformat}
> ReadInt: 983 ms, 30.497877855999185 million entries/sec
> ReadLongSmall: 1058 ms, 28.336666040111496 million entries/sec
> ReadLong: 1518 ms, 19.75179889508437 million entries/sec
> ReadFloat: 657 ms, 45.61031157924184 million entries/sec
> ReadDouble: 761 ms, 39.387756709704355 million entries/sec
> ReadBoolean: 331 ms, 90.4268145647456 million entries/sec
> RepeaterTest: 7718 ms, 3.886725782038378 million entries/sec
> NestedRecordTest: 1884 ms, 15.91964611687992 million entries/sec
> ResolverTest: 8296 ms, 3.616055866616717 million entries/sec
> MigrationTest: 21216 ms, 1.4139999570144013 million entries/sec
> {noformat}
> *buffering BinaryDecoder*
> {noformat}
> ReadInt: 187 ms, 160.22131904871262 million entries/sec
> ReadLongSmall: 372 ms, 80.4863521975457 million entries/sec
> ReadLong: 613 ms, 48.882385721129246 million entries/sec
> ReadFloat: 253 ms, 118.16606270679061 million entries/sec
> ReadDouble: 275 ms, 108.94314257389068 million entries/sec
> ReadBoolean: 222 ms, 134.85327963176064 million entries/sec
> RepeaterTest: 3335 ms, 8.993007936329503 million entries/sec
> NestedRecordTest: 1152 ms, 26.0256943004597 million entries/sec
> ResolverTest: 4213 ms, 7.120659335077578 million entries/sec
> MigrationTest: 15310 ms, 1.9594884898992941 million entries/sec
> {noformat}
> Performance is 2x to 5x the throughput of trunk on most tests.  

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message