avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API
Date Wed, 18 Aug 2010 17:37:22 GMT
On 08/18/2010 10:18 AM, ey-chih chow wrote:
> Thanks. But by doing this way, what kind of advantage we can get from Avro?

The Avro MapReduce API is easiest to use when both inputs and outputs 
are Avro data.

If inputs are not Avro data, but you want to use the rest of the Avro MR 
API, then you'd need to write an InputFormat that produces an 
AvroWrapper<T> where T is a type that Avro can serialize.

Another alternative might be to first convert your inputs to be avro 
data files.  For example, one can use Avro's 'fromtext' tool to convert 
line-oriented files into equivalent compressed, splittable, Avro data 
files.  This could be done as log files are loaded into HDFS, since this 
tool accepts Hadoop paths as output.

We hope to add more such tools for such conversion/ingest, e.g.:

https://issues.apache.org/jira/browse/AVRO-458

We also expect that systems like Flume will produce Avro data files.

Doug

Mime
View raw message