avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Public Network Services <publicnetworkservi...@gmail.com>
Subject Re: Generic Avro Classification and Deserialization
Date Sat, 19 Jan 2013 00:46:06 GMT
Thanks for the help!

I am trying to find sample Avro files and it turns out to be surprisingly
difficult (at least via the Google searches I tried).

Would you know of any such files (preferably large-ish) in the open source?

On Fri, Jan 18, 2013 at 6:53 AM, Terry Healy <thealy@bnl.gov> wrote:

> Check out avro-tools. With this you can dump the schema for a file,
> extract the metadata, or export it in several formats:
> ----------------
> Available tools:
>       compile  Generates Java code for the given schema.
>    fragtojson  Renders a binary-encoded Avro datum as JSON.
>      fromjson  Reads JSON records and writes an Avro data file.
>      fromtext  Imports a text file into an avro data file.
>       getmeta  Prints out the metadata of an Avro data file.
>     getschema  Prints out schema of an Avro data file.
>           idl  Generates a JSON schema from an Avro IDL file
>        induce  Induce schema/protocol from Java class/interface via
> reflection.
>    jsontofrag  Renders a JSON-encoded Avro datum as binary.
>       recodec  Alters the codec of a data file.
>    rpcreceive  Opens an RPC Server and listens for one message.
>       rpcsend  Sends a single RPC message.
>        tether  Run a tethered mapreduce job.
>        tojson  Dumps an Avro data file as JSON, one record per line.
>        totext  Converts an Avro data file to a text file.
>   trevni_meta  Dumps a Trevni file's metadata as JSON.
> trevni_random  Create a Trevni file filled with random instances of a
> schema.
> trevni_tojson  Dumps a Trevni file as JSON.
> -Terry
> On 01/17/2013 05:11 PM, Public Network Services wrote:
> > Folks,
> >
> > I am involved in a project to extract data from a large number of files
> > (to be provided at some point), in numerous formats, among which is some
> > Avro files (both binary and JSON-encoded), and thus I am looking for the
> > best way to tackle this.
> >
> > One of the things we would (ideally) like to do is auto-classify the
> > data generically, i.e. read a few lines or bytes off a file and be able
> > to tell what kind of format it is.
> >
> > This is fairly easy to do with, say, (non-Avro) JSON files, but I am not
> > sure how this would be done for Avro.
> >
> > For one thing, there is the necessity of a Schema, about which the
> > documentation says that
> >
> >   * "Avro data is always serialized with its schema. Files that store
> >     Avro data should always also include the schema for that data in the
> >     same file."
> >
> > However, the Java code examples posted on the project website imply that
> > the Schema is supplied as a separate file and I am not sure whether this
> > is only the case with RPC.
> >
> > Are there any code examples for detecting the encoding format
> > (binary/json) of the data file, assessing whether there is a schema
> > embedded in it and extracting that schema?
> >
> > Thanks!

View raw message