avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Kleppmann <mar...@rapportive.com>
Subject Re: Mixed Avro/Hadoop Writable pipeline
Date Thu, 04 Jul 2013 14:19:40 GMT
Hadoop's writables use Java's java.io.Data{Input,Output}Stream by default
(see org.apache.hadoop.io.serializer.WritableSerialization). This uses a
fixed-length encoding: 4 bytes for an int, 8 bytes for a long.

Avro-encoded numbers are always variable-length (if you want fixed-length,
use a 'fixed' type in the schema).


On 4 July 2013 11:14, Dan Filimon <dangeorge.filimon@gmail.com> wrote:

> The documentation for IntWritable doesn't explicitly mention it being
> fixed-length or not [1]. But, given there's also a VIntWritable [2], I
> think IntWritable is always 4 bytes.
> [1]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/IntWritable.html
> [2]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/VIntWritable.html
> On Thu, Jul 4, 2013 at 1:02 PM, Pradeep Gollakota <pradeepg26@gmail.com>wrote:
>> Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is
>> variable length. If the number can be represented in less than 4 bytes, it
>> will.
>> On Jul 4, 2013 2:22 AM, "Dan Filimon" <dangeorge.filimon@gmail.com>
>> wrote:
>>> Well, I got it working eventually. :)
>>> First of all, I'll mention that I'm using the new MapReduce API, so no
>>> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
>>> AvroValue<> wrappers and once I set the right properties using AvroJob's
>>> static methods (AvroJob.setMapOutputValueSchema() for example) and set the
>>> input to be an AvroKeyInputFormat, everything worked out fine.
>>> About the writables, I'm interested to know whether it'd be better to
>>> use Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume
>>> speed/size of these two should be the same 4 bytes?
>>> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann <martin@rapportive.com>wrote:
>>>> Hi Dan,
>>>> You're stepping off the documented path here, but I think that although
>>>> it might be a bit of work, it should be possible.
>>>> Things to watch out for: you might not be able to use
>>>> AvroMapper/AvroReducer so easily, and you may have to mess around with the
>>>> job conf a bit (Avro-configured jobs use their own shuffle config with
>>>> AvroKeyComparator, which may not be what you want if you're also trying to
>>>> use writables). I'd suggest simply reading the code in
>>>> org.apache.avro.mapred[uce] -- it's not too complicated.
>>>> Whether Avro files or writables (i.e. Hadoop sequence files) are better
>>>> for you depends mostly on which format you'd rather have your data in. If
>>>> you want to read the data files with something other than Hadoop, Avro is
>>>> definitely a good option. Also, Avro data files are self-describing (due
>>>> their embedded schema) which makes them pleasant to use with tools like Pig
>>>> and Hive.
>>>> Martin
>>>> On 3 July 2013 10:12, Dan Filimon <dangeorge.filimon@gmail.com> wrote:
>>>>> Hi!
>>>>> I'm working on integrating Avro into our data processing pipeline.
>>>>>  We're using quite a few standard Hadoop and Mahout writables
>>>>> (IntWritable, VectorWritable).
>>>>> I'm first going to replace the custom Writables with Avro, but in
>>>>> terms of the other ones, how important would you say it is to use
>>>>> AvroKey<Integer> instead of IntWritable for example?
>>>>> The changes will happen gradually but are they even worth it?
>>>>> Thanks!

View raw message