avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaikit Savla <jaikit.sa...@yahoo.com>
Subject AvroStorage taking long time to load and iterate over records
Date Thu, 26 Mar 2015 22:50:21 GMT
I am noticing weird behavior where loading and iterating avro records via AvroStorage takes
long time as compared to  iterating via MapReduce job.  Any known issues or any clue as
to why AvroStorage would take such long time ?
Example:Schema which I am using:
{  "type": "record",  "name": "Timber",  "namespace": "com.timber.avro",  "fields": [ 
  {      "name": "identifier",      "type": "string",      "doc": "Identifier. NonNull." 
  },    {      "name": "reservation",      "type": [        "null",       
{          "type": "array",          "items": {            "name": "Reservation", 
          "type": "record",            "fields": [              {     
          "name": "bookingDate",                "type": "long",         
      "doc": "Timestamp in UTC. NonNull"              },              { 
              "name": "code",                "type": [             
    "null",                  "string"                ],           
    "doc": "Code.",                "default": null              },    
                     ]          }        }      ],      "default":
null,      "doc": "array of segment id which this urn belongs."    }  ]}
---> Pig
using Pig AvroStorage, it takes more than 30 minutes to simple iterate. I have been adding
more optional fields (like code) in above Reservation record. Does that affect how I am using
AvroStorage ?
register /json-simple-1.1.jarregister /piggybank.jar
records = LOAD '/data/*/one.avro'          USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check')
reservation = FOREACH records {            selectHotelAtt = FOREACH reservation GENERATE
bookingDate;            GENERATE FLATTEN(selectHotelAtt.bookingDate) as bookingDate; 
              };DUMP reservation;

--> MapReduceWhen I use MapReduce job to iterate through all the records it completes in
less than 2 minutes for about million records
Mapper interface        @Override        public void map(final AvroKey<Timber>
key, final NullWritable value, final Context context) throws IOException, InterruptedException

View raw message