crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From B Kersbergen <>
Subject read an org.apache.avro.mapreduce.KeyValuePair datafile
Date Fri, 04 Jul 2014 12:06:04 GMT

I’m stuck trying to read an avro KeyValuePair datafile with crunch.

This is a header dump of my avro data file:

key/value pair","fields":[{"name":"key","type":"string","doc":"The
etc etc

I’m only interested in the LiveTrackingLine object but I probably need
to read the whole KeyValuePair object and extract the LiveTrackingLine
in the crunch pipeline.

This is the code I have so far.

String inputPath = args[0];
String outputPath = args[1];
Pipeline pipeline = new MRPipeline(ExtractViewsJob.class,
ExtractViewsJob.class.getSimpleName(), getConf());
PCollection<AvroWrapper <Pair<String, LiveTrackingLine>>> lines = AvroFileSource<AvroWrapper<Pair<String,

I’m a bit lost in the last part where I configure the 'pipeline'
object with the right avro schema(s) and input dir.

Can someone help me with this? Because my schema is very complex I
want to parse this as a ‘specific’ and not as a ‘generic’ or
‘reflective’ avro representation, this is also a learning experience
in using avro with crunch.

Kind regards,

View raw message