hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-794) Use Avro serialization in Pig
Date Fri, 10 Jul 2009 16:19:14 GMT

    [ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729700#action_12729700

Alan Gates commented on PIG-794:

I agree with Doug's comments that it's better to use an API to build the schema that will
give us compile time checking.  I think it will also (hopefully) be easier to figure out the
schema when reading the code, as it will avoid the need to read JSON directly.

I have a general question on the approach.  This is a direct port of Pig's BinStorage to use
Avro, including the writing of indicator bytes for types.  I do not have a deep knowledge
of Avro.  But I had assumed that since it was a de/serialization framework with types, part
of what it would provide was type recognition.  That is, can't this code rely on Avro to set
the type for it?  Do we need to be writing those indicator bytes ourselves?  Perhaps this
is the same comment that Doug is making about using GenericDatumReader and addField.

In response to Hong's comment, the sync marks are vulnerable as you point out.  But the loader
needs some way to find a proper starting place when it's handed any block but the initial
block of a file.  I wonder if we could create a new sync type.  It would always consist of
a 100 byte marker (say the first 25 prime numbers, or the first 25 digits of pi or something).
 We could then write a tuple with that sync type every 1000 records in the data.  Loaders
that don't start at position 0 could then seek to the first sync type it found before it began
reading.  All loaders would read past the end of their position until they saw a sync type.

As for this being compatible with with non-pig apps, that isn't the purpose of this AvroStorage
function.  This is for pig to pass data between MR jobs for itself.  Having a tool independent
storage format is a bigger project, as it requires agreeing on things like sync marks, how
to represent different Avro objects, etc.

> Use Avro serialization in Pig
> -----------------------------
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>             Fix For: 0.2.0
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar,
> We would like to use Avro serialization in Pig to pass data between MR jobs instead of
the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly
better compared to BinStorage on our benchmarks.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message