avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christophe Taton <ta...@wibidata.com>
Subject Re: Scala API
Date Wed, 30 May 2012 23:26:40 GMT
Thanks a lot for your replies!

On Wed, May 30, 2012 at 2:52 PM, Scott Carey <scottcarey@apache.org> wrote:

> This would be fantastic.  I would be excited to see it.  It would be great
> to see a Scala language addition to the project if you wish to contribute.
> I believe there have been a few other Scala Avro attempts by others over
> time.   I recall one where all records were case classes (but this broke at
> 22 fields).
> Another thing to look at is:
> http://code.google.com/p/avro-scala-compiler-plugin/
> Perhaps we can get a few of the other people who have developed Scala Avro
> tools to review/comment or contribute as well?

That would be great!
I just filed https://issues.apache.org/jira/browse/AVRO-1105 to record
feedback there.
I will file more targeted issues and post an initial patch soon.

On 5/29/12 11:04 PM, "Christophe Taton" <taton@wibidata.com> wrote:
> Hi people,
> Is there interest in a custom Scala API for Avro records and protocols?
> I am currently working on an schema compiler for Scala, but before I go
> deeper, I would really like to have external feedback.
> I would especially like to hear from anyone who has opinions on how to map
> Avro types onto Scala types.
> Here are a few hints on what I've been trying so far:
>    - Records are compiled into two forms: mutable and immutable.
> Very nice.
>    - To avoid collisions with Java generated classes, scala classes are
>    generated in a .scala sub-package.
>    - Avro arrays are translated to Seq/List when immutable and
>    Buffer/ArrayBuffer when mutable.
>    - Avro maps are translated to immutable or mutable Map/HashMap.
>    - Bytes/Fixed are translated to Seq[Byte] when immutable and
>    Buffer[Byte] when mutable.
>    - Avro unions are currently translated into Any, but I plan to:
>    - translate union{null, X} into Scala Option[X]
>       - compile union {T1, T2, T3} into a custom case classes to have
>       proper type checking and pattern matching.
> If you have a record R1, it compiles to a Scala class.  If you put it in a
> union of {T1, String}, what does the case class for the union look like?
>  Is it basically a wrapper like a specialized Either[T1, String] ?   Maybe
> Scala will get Union types later to push this into the compiler instead of
> object instances :)

I was thinking of using Either[X,Y] but this does not scale.

Assuming this union schema:

record Rec {
  union { int, array<int>, Record1 } field1;

If unions are compiled to Any, Scala can match on simple types:

field1 match {
case value: Int => ...
case value: Array[Int] => ...
case value: Record1 => ...

But this does not work in all cases because of type erasures. Maybe this
would work with scala 2.10 and runtime type reification. In all cases, Any
would not provide type safety...

For now, I am planning on generating the following:

abstract class Field1Union
case class Field1Int(data: Int)
case class Field1ArrayInt(data: ArrayInt)
case class Field1Record1(data: Record1)

Each case class only has one constructor parameter, so this should not hit
the 22 constructor parameters limit of case classes.
Constructing a record would look like:

val rec = new Rec(field1=new Field1Int(1))
or val rec = new Rec(field1=new Field1ArrayInt(...))

And reading the union field would look like:

rec.field1 match {
  case Field1Int(intValue) => ...
  case Field1ArrayInt(array) => ...
  case Field1Record1(rec1) => ...


>    - Scala records provide a method encode(encoder) to serialize as
>    binary into a byte stream (appears ~30% faster than SpecificDatumWriter).
>    - Scala mutable records provide a method decode(decoder) to
>    deserialize a byte stream (appears ~25% faster than SpecificDatumReader).
> I have some plans to improve {Generic,Specific}Datum{Reader,Writer}  in
> Java, I would be interested in seeing how the Scala one here works.  The
> Java one is slowed by traversing too many data structures that represent
> decisions that could be pre-computed rather than repeatedly parsed for each
> record.

The scala reader/writer is very straightforward. It is a shortcut that most
likely does not work in all cases (especially when decoding from another
schema version).
If you want to have a look, I attached one schema I am using for testing
and the generated code.

>    - Scala records implement the SpecificRecord Java interface (with some
>    overhead), so one may still use the SpecificDatumReader/Writer when the
>    custom encoder/decoder methods cannot be used.
>    - Mutable records can be converted to immutable (ie. can act as
>    builders).
> Thanks,

View raw message