crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From B Kersbergen <kersberg...@gmail.com>
Subject read an org.apache.avro.mapreduce.KeyValuePair datafile
Date Fri, 04 Jul 2014 12:06:04 GMT
Hi,

I’m stuck trying to read an avro KeyValuePair datafile with crunch.

This is a header dump of my avro data file:

{"type":"record","name":"KeyValuePair","namespace":"org.apache.avro.mapreduce","doc":"A
key/value pair","fields":[{"name":"key","type":"string","doc":"The
key"},{"name":"value","type":{"type":"record","name":"LiveTrackingLine","namespace":"com.bol.hadoop.enrich.record",
etc etc

I’m only interested in the LiveTrackingLine object but I probably need
to read the whole KeyValuePair object and extract the LiveTrackingLine
in the crunch pipeline.

This is the code I have so far.

String inputPath = args[0];
String outputPath = args[1];
Pipeline pipeline = new MRPipeline(ExtractViewsJob.class,
ExtractViewsJob.class.getSimpleName(), getConf());
PCollection<AvroWrapper <Pair<String, LiveTrackingLine>>> lines =
pipeline.read(new AvroFileSource<AvroWrapper<Pair<String,
LiveTrackingLine>>>(inputPath));

I’m a bit lost in the last part where I configure the 'pipeline'
object with the right avro schema(s) and input dir.

Can someone help me with this? Because my schema is very complex I
want to parse this as a ‘specific’ and not as a ‘generic’ or
‘reflective’ avro representation, this is also a learning experience
in using avro with crunch.

Kind regards,
Barrie

Mime
View raw message