avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-24) benchmark bulk data
Date Thu, 25 Jun 2009 20:22:07 GMT

    [ https://issues.apache.org/jira/browse/AVRO-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724252#action_12724252

Doug Cutting commented on AVRO-24:

> Are bulk transfers already part of the spec?

The idea is that bulk transfers can be efficiently implemented by just using the 'bytes' type
in parameter, field and/or return values.  When a large value of type bytes is transmitted,
it generates a separate frame at the transport layer.  Clients can then read and write such
large values without copying.  On write, if one passes a large ByteBuffer as a parameter,
field or return value, a reference is passed down and it is written directly to the socket.
 Similarly, on read, the ByteBuffer that's read from the socket is directly returned to the
client as the value of the field, parameter or method.


This is not yet perfect.  First, while Avro permits object reuse, its RPC framework does not.
 So, if an RPC method returns a ByteBuffer, a new ByteBuffer will be allocated per call. 
However we could easily add a pool here to address this.

Second, sendfile is not yet supported.  This would require using an alternate representation
for values of type bytes.  One might define something like:

interface ByteChannelable {
  int write(WritableByteChannel c);
  int read(ReadableByteChannel c);
  byte[] bytes();
  void bytes(byte[]);
  ByteBuffer buffer();

Then one could implement a version of this that contains a FileChannel and a start and end
position whose read and write methods would call transferFrom and transferTo.

We could switch to such a representation by default, instead of using ByteBuffer (which unfortunately
cannot be extended).  Note that any Requestor and Responder can easily be extended to use
a different DatumReader, so we would not have to make this the default.

But first, I thought we'd benchmark things without these changes to get a baseline.

> benchmark bulk data
> -------------------
>                 Key: AVRO-24
>                 URL: https://issues.apache.org/jira/browse/AVRO-24
>             Project: Avro
>          Issue Type: Task
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.0.0
> It would be good to validate that the RPC wire format is capable of transmitting bulk
data efficiently.  In particular, to be used for HDFS file access, it must be able to, when
including file data in an RPC response, or writing file data in an RPC request:
>  - saturate a disk's throughput or a network interface; and
>  - not consume much CPU.
> In other words, Avro's RPC should not be a bottleneck in the transfer of file data from
a remote disk to an application or vice versa, and moreover it should leave the vast majority
of the CPU for the application.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message