hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: How does Hadoop reuse the objects?
Date Thu, 04 Aug 2011 21:16:00 GMT
Wow, I didn't expect that. That's nastier than usual. I would think
that cloning by serializing/deserializing would be unnecessarily slow.
I would file a JIRA with Avro asking for a clone() or copy constructor
in generated code.

-Joey

On Thu, Aug 4, 2011 at 5:07 PM, Vyacheslav Zholudev
<vyacheslav.zholudev@gmail.com> wrote:
> Just sharing my today's discovery:
> Hadoop also reuses objects in internal lists, in my example the BAR objects.
> That is if the first FOO object has two BAR objects in the list, then the
> second FOO object will contain the same (equal by reference) first two BAR
> objects in the list. So in case of Avro it would be good if auto-generated
> code implemented a 'clone' method.
> Btw, is it good to clone avro-specific objects by serializing/deserializing
> using SpecificDatum{Writer|Reader}?
> Vyacheslav
>
> On 4 August 2011 21:35, <Milind.Bhandarkar@emc.com> wrote:
>>
>> HADOOP-2399 has caused a lot of problems for users so far, and the saga
>> still continues :-(
>>
>> I remember spending 18 straight hours in 2008 with a user debugging this
>> issue.
>>
>> - milind
>>
>> ---
>> Milind Bhandarkar
>> Greenplum Labs, EMC
>> (Disclaimer: Opinions expressed in this email are those of the author, and
>> do
>> not necessarily represent the views of any organization, past or present,
>> the author might be affiliated with.)
>>
>>
>>
>>
>> On 8/3/11 4:19 AM, "Joey Echeverria" <joey@cloudera.com> wrote:
>>
>> >Hadoop reuses objects as an optimization. If you need to keep a copy
>> >in memory, you need to call clone yourself. I've never used Avro, but
>> >my guess is that the BARs are not reused, only the FOO.
>> >
>> >-Joey
>> >
>> >On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
>> ><vyacheslav.zholudev@gmail.com> wrote:
>> >> Hi all,
>> >>
>> >> I'm using Avro as a serialization format and assume I have a generated
>> >>specific class FOO that I use as a Mapper output format:
>> >>
>> >> class FOO {
>> >>  int a;
>> >>  List<BAR> barList;
>> >> }
>> >>
>> >> where BAR is another generated specific Java class.
>> >>
>> >> When I iterate over "Iterable<FOO> values" in the Reducer it is clear
>> >>that the same object of class FOO is reused, i.e.
>> >> FOO foo1 = values.iterator.next();
>> >> FOO foo2 = values.iterator.next();
>> >> assertThat(foo1 == foo2, is (true));
>> >>
>> >> So I have the following questions:
>> >> 1) Is the list barList reused over the next() calls?
>> >> 2) If yes, can the objects that are in the barList be reused? For
>> >>example, if the first time next() is called, the list contains two BAR
>> >>objects, the next time next() is called the barList contains 3 objects
>> >>and 2 of them are equal by reference to the two from the list of the
>> >>first next() call. In other words, does Hadoop maintain some sort of
>> >>"object pool"?
>> >> 3) Why do not AvroTools  generate clone() methods since it would be
>> >>quite straightforward and more importantly useful given that objects are
>> >>reused?
>> >>
>> >> Thanks a lot in advance!
>> >>
>> >> Vyacheslav
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >--
>> >Joseph Echeverria
>> >Cloudera, Inc.
>> >443.305.9434
>> >
>>
>
>
>
> --
> Best,
> Vyacheslav Zholudev
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Mime
View raw message