avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Jurney <russell.jur...@gmail.com>
Subject AvroStorage/Avro Schema Question
Date Fri, 30 Mar 2012 01:05:19 GMT
Is it possible to name string elements in the schema of an array?
 Specifically, below I want to name the email addresses in the
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
Pig's AvroStorage.  I know I can probably fix this in Java in the Pig
AvroStorage UDF, but I'm hoping I can also fix it more easily in the
schema.  Last time I read Avro's array docs in this context, my hit-points
dropped by a third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

    "namespace": "agile.data.avro",
    "name": "Email",
    "type": "record",
    "fields": [
        {"name":"message_id", "type": ["string", "null"]},
        {"name":"from","type": [{"type":"array", "items":"string"},
        {"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
        {"name":"reply_to", "type": [{"type":"array", "items":"string"},
        {"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
        {"name":"subject", "type": ["string", "null"]},
        {"name":"body", "type": ["string", "null"]},
        {"name":"date", "type": ["string", "null"]}

Pig to publish my Avros:

grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER:
(ARRAY_ELEM: chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},reply_to: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},in_reply_to:
{PIG_WRAPPER: (ARRAY_ELEM: chararray)},subject: chararray,body:
chararray,date: chararray}

grunt> store emails into 'mongodb://localhost/agile_data.emails' using

My emails in MongoDB:

> db.emails.findOne()
"_id" : ObjectId("4f738a35414e113e75707b97"),
"message_id" : "<4f71abddc19ec_145449e3898474d2@li169-134.mail>",
"from" : [
"ARRAY_ELEM" : "daily@jobchangealerts.com"
"to" : [
"ARRAY_ELEM" : "Russell.jurney@gmail.com"
"cc" : null,
"bcc" : null,
"reply_to" : null,
"in_reply_to" : null,
"subject" : "Daily Job Change Alerts from SalesLoft",
"body" : "Daily Job Change Alerts from SalesLoft",
"date" : "2012-03-27T08:00:29"

My email on screen:

[image: Inline image 1]

My face when I see ARRAY_ELEM, because it means more complex presentation
code: *:(*
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

View raw message