avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ed <edor...@gmail.com>
Subject Re: Avro MapReduce (MR1): AvroMultipleOutputs using Avro 1.7.3 and CDH 4.5.0
Date Fri, 17 Jan 2014 08:09:51 GMT
Hi Eric,

Thank you for the assistance.  Just to confirm it looks like we have to use
the mapreduce API (vice mapred) to get AvroMultipleOutputs working if you
want to customize the name and directory of your file.  Is that correct?
 The mapred API takes AvroOutputFormat as the class when calling
AvroMultipleOutputs.addNamedOutput instead of AvroKeyOutputFormat and I see
no way to customize the directory by extending AvroOutputFormat.

Thanks!

Best Regards,

Ed


On Fri, Jan 17, 2014 at 5:00 AM, Eric Turcotte <luminary@gmail.com> wrote:

> I didn't have much luck using Avro 1.7.3.  There was a bugfix that went
> into 1.7.4 that made the multiple outputs easier to use.  From 1.7.3 to
> 1.7.4 the methods changed quite a bit.  Here is an example below that
> worked for me and the rest of this info assumes you are using 1.7.4
>
> As for outputting to different directories, I think this would be possible
> if you subclass AvroKeyOutputFormat and override the protected method
> getAvroFileOutputStream.  When the writer is initialized, this method
> creates the outputstream with the work path as the directory and the
> filename as namedOutput-r-00000.avro.  You could probably write to an
> entirely different path.  Since getAvroFileOutputStream takes
> the TaskAttemptContext, it could be easily configurable.
>
> Runner:
>         job.setMapOutputKeyClass(NullWritable.class);
>         AvroJob.setMapOutputValueSchema(job, ValidatedValue.SCHEMA$);
>         job.setReducerClass(ValueReducer.class);
>
>         AvroMultipleOutputs.addNamedOutput(job, "Value",
> AvroKeyOutputFormat.class, Value.SCHEMA$); // use custom directory
> subclass here
>         AvroMultipleOutputs.addNamedOutput(job, "Invalid",
> AvroKeyOutputFormat.class, Schema.create(Schema.Type.STRING));
>
> Reducer:
>
> public class ValueReducer
>         extends Reducer<NullWritable, AvroValue<ValidatedValue>,
> WritableComparable, NullWritable> {
>     private AvroMultipleOutputs multipleOutputs;
>
>     @Override
>     protected void setup(Context context) throws IOException,
> InterruptedException {
>         multipleOutputs = new AvroMultipleOutputs(context);
>     }
>
>     @Override
>     protected void cleanup(Context context) throws IOException,
> InterruptedException {
>         multipleOutputs.close();
>     }
>
>     @Override
>     protected void reduce(NullWritable key,
> Iterable<AvroValue<ValidatedValue>> validatedValues, Context context)
>             throws IOException, InterruptedException {
>         for (AvroValue<ValidatedValue> validatedValue : validatedValues) {
>             multipleOutputs.write("Value", new
> AvroKey<Value>(validatedValue.getValue()));
>             if (validatedValue.isInvalid()) {
>                 multipleOutputs.write("Invalid", new
> AvroKey<String>(validatedValue.getValidationError()));
>             }
>         }
>     }
> }
>
>
> Hope that helps
>
>
>
> On Thu, Jan 16, 2014 at 6:57 AM, ed <edorsey@gmail.com> wrote:
>
>> Hello,
>>
>> Per the documentation on the Cloudera site here:
>>
>>
>> http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_26_5.html
>>
>> I've been using Avro 1.7.3 with my version of CDH 4.5.0 and can't figure
>> out how to completely customize the output name and path.
>>
>> In 1.7.5 it looks like you can specify a new basePath (and I'm assuming
>> filename) in the reducer using the call
>>
>> AvroMultipleOutputs.collect(namedOutput, reporter, schema, datum,
>>> baseOutputPath)
>>
>>
>> Unfortunately 1.7.3 does not have this method.  Does anyone have an
>> example of using AvroMultipleOutputs in Avro 1.7.3 to write to different
>> files in different directories from the Reducer?
>>
>> The closest I've gotten so far is to have a different files show up in
>> the same directory where the name is based on the namedOutput value.  For
>> example if I make the call:
>>
>> amos.getCollector("logdata1", reporter).collect(log); //amos is the
>>> AvroMultipleOutputs object
>>
>>
>> I end up with a file in my job output directory called
>> "logdata1-r-00000.avro"
>>
>> Is it possible to customize beyond this?  Specifically, I'd like to set
>> the output file to be in a different directory using Avro 1.7.3.
>>
>> Thank you!
>>
>> Best Regards,
>>
>> Ed
>>
>>
>>
>>
>>
>

Mime
View raw message