avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Turcotte <lumin...@gmail.com>
Subject Re: Avro MapReduce (MR1): AvroMultipleOutputs using Avro 1.7.3 and CDH 4.5.0
Date Thu, 16 Jan 2014 20:00:14 GMT
I didn't have much luck using Avro 1.7.3.  There was a bugfix that went
into 1.7.4 that made the multiple outputs easier to use.  From 1.7.3 to
1.7.4 the methods changed quite a bit.  Here is an example below that
worked for me and the rest of this info assumes you are using 1.7.4

As for outputting to different directories, I think this would be possible
if you subclass AvroKeyOutputFormat and override the protected method
getAvroFileOutputStream.  When the writer is initialized, this method
creates the outputstream with the work path as the directory and the
filename as namedOutput-r-00000.avro.  You could probably write to an
entirely different path.  Since getAvroFileOutputStream takes
the TaskAttemptContext, it could be easily configurable.

Runner:
        job.setMapOutputKeyClass(NullWritable.class);
        AvroJob.setMapOutputValueSchema(job, ValidatedValue.SCHEMA$);
        job.setReducerClass(ValueReducer.class);

        AvroMultipleOutputs.addNamedOutput(job, "Value",
AvroKeyOutputFormat.class, Value.SCHEMA$); // use custom directory subclass
here
        AvroMultipleOutputs.addNamedOutput(job, "Invalid",
AvroKeyOutputFormat.class, Schema.create(Schema.Type.STRING));

Reducer:

public class ValueReducer
        extends Reducer<NullWritable, AvroValue<ValidatedValue>,
WritableComparable, NullWritable> {
    private AvroMultipleOutputs multipleOutputs;

    @Override
    protected void setup(Context context) throws IOException,
InterruptedException {
        multipleOutputs = new AvroMultipleOutputs(context);
    }

    @Override
    protected void cleanup(Context context) throws IOException,
InterruptedException {
        multipleOutputs.close();
    }

    @Override
    protected void reduce(NullWritable key,
Iterable<AvroValue<ValidatedValue>> validatedValues, Context context)
            throws IOException, InterruptedException {
        for (AvroValue<ValidatedValue> validatedValue : validatedValues) {
            multipleOutputs.write("Value", new
AvroKey<Value>(validatedValue.getValue()));
            if (validatedValue.isInvalid()) {
                multipleOutputs.write("Invalid", new
AvroKey<String>(validatedValue.getValidationError()));
            }
        }
    }
}


Hope that helps



On Thu, Jan 16, 2014 at 6:57 AM, ed <edorsey@gmail.com> wrote:

> Hello,
>
> Per the documentation on the Cloudera site here:
>
>
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_26_5.html
>
> I've been using Avro 1.7.3 with my version of CDH 4.5.0 and can't figure
> out how to completely customize the output name and path.
>
> In 1.7.5 it looks like you can specify a new basePath (and I'm assuming
> filename) in the reducer using the call
>
> AvroMultipleOutputs.collect(namedOutput, reporter, schema, datum,
>> baseOutputPath)
>
>
> Unfortunately 1.7.3 does not have this method.  Does anyone have an
> example of using AvroMultipleOutputs in Avro 1.7.3 to write to different
> files in different directories from the Reducer?
>
> The closest I've gotten so far is to have a different files show up in the
> same directory where the name is based on the namedOutput value.  For
> example if I make the call:
>
> amos.getCollector("logdata1", reporter).collect(log); //amos is the
>> AvroMultipleOutputs object
>
>
> I end up with a file in my job output directory called
> "logdata1-r-00000.avro"
>
> Is it possible to customize beyond this?  Specifically, I'd like to set
> the output file to be in a different directory using Avro 1.7.3.
>
> Thank you!
>
> Best Regards,
>
> Ed
>
>
>
>
>

Mime
View raw message