avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Douglas Creager <doug...@creagertino.net>
Subject Re: Getting started with Avro + Reading from an Avro formatted file
Date Tue, 24 Jan 2012 15:54:00 GMT
> I want to be able to read from an Avro formatted log file (specifically the History Log
file created at the end of a Hadoop job) and create a Comma Separated file of certain log
entries. I need a csv file because this is the format that is accepted by post processing
software I am working with (eg: Matlab).
> Initially I was using a BASH script to grep and awk from this file and create my CSV
file because I needed a very few values from it, and a quick script just worked. I didn't
try to get to know what format the log file was in and utilize that. (my bad!)  Now that I
need to be scaling up and want to have a reliable way to parse, I would like to try and do
it the right way. 
> My question is this: For the above goal, could you please guide me with steps I can follow
- such as reading material and libraries I could try to use. As I go through the Quick Start
Guide and FAQ, I see that a lot of the information here is geared to someone who wants to
use the data serialization and RPC functionality provided by Avro. Given that I only want
to be able to "read", where may I start?
> I can comfortably script with BASH and Perl. Given that I only see support for Java,
Python and Ruby, I think I can take this as as opportunity to learn Python and get up to speed.

You could also take a look at the C bindings.  We've recently added a couple of command-line
tools for outputting the contents of an Avro file to stdout: avrocat and avropipe.  avrocat
outputs each record in an Avro file on a single line, using the JSON encoding defined by the
Avro spec [1].  avropipe produces a separate line for each “field” in each record; its
output is (roughly speaking) what you'd get from piping the JSON encoding of each record through
the jsonpipe [2] tool.  (Technically speaking, it's what you get from putting all of the records
into a JSON array, and sending that array through jsonpipe.)

[1] http://avro.apache.org/docs/current/spec.html#json_encoding
[2] https://github.com/dvxhouse/jsonpipe

So, with the example quickstop.db file, the avrocat gives you:

  $ avrocat examples/quickstop.db | head
  {"ID": 1, "First": "Dante", "Last": "Hicks", "Phone": "(0)", "Age": 32}
  {"ID": 2, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30}
  {"ID": 3, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28}
  {"ID": 4, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27}
  {"ID": 5, "First": "Bob", "Last": "Silent", "Phone": "(555) 123-6422", "Age": 29}
  {"ID": 6, "First": "Jay", "Last": "???", "Phone": "(0)", "Age": 26}
  {"ID": 7, "First": "Dante", "Last": "Hicks", "Phone": "(1)", "Age": 32}
  {"ID": 8, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30}
  {"ID": 9, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28}
  {"ID": 10, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27}

While avropipe gives you:

  $ avropipe examples/quickstop.db | head -n 25
  /	[]
  /0	{}
  /0/ID	1
  /0/First	"Dante\u0000"
  /0/Last	"Hicks\u0000"
  /0/Phone	"(0)\u0000"
  /0/Age	32
  /1	{}
  /1/ID	2
  /1/First	"Randal\u0000"
  /1/Last	"Graves\u0000"
  /1/Phone	"(555) 123-5678\u0000"
  /1/Age	30
  /2	{}
  /2/ID	3
  /2/First	"Veronica\u0000"
  /2/Last	"Loughran\u0000"
  /2/Phone	"(555) 123-0987\u0000"
  /2/Age	28
  /3	{}
  /3/ID	4
  /3/First	"Caitlin\u0000"
  /3/Last	"Bree\u0000"
  /3/Phone	"(555) 123-2323\u0000"
  /3/Age	27

Although I'm seeing a bug there, since those NUL terminators shouldn't appear in the output.
 I'm going to open a ticket for that and fix it real quick.  But, these tools might be exactly
what you need, especially since the C bindings don't have any library dependencies to install.


View raw message