hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alberto Cordioli <cordioli.albe...@gmail.com>
Subject Re: GroupingComparator
Date Tue, 16 Oct 2012 08:42:09 GMT
Thanks Dave.
You solved my problem. Just a little question about your tip:
I suppose also the value returned by iterator.next() is re-used.
So if want to store some values of the Iterable list in the reducer, I
should create a List and put cloned objects inside it.
In this case there is no possibility to avoid the "new" operator, right?



On 15 October 2012 22:49, Dave Beech <dbeech@apache.org> wrote:
> Well, if all you need is the tag (the 1 or 2), why not just use a Text
> or IntWritable instance variable. You wouldn't need to clone the whole
> key.
>
> Then, instead of tag = key.getSecondField() you'd say
> tag.set(key.getSecondField().get());
> I don't know what type of object tag is (if it's Text you'll say
> toString() rather than get()), but you see what I mean.
>
> Also - just a tip - try to avoid creating new objects wherever
> possible. You'll get better performance if you create one Text object
> as an instance variable and re-use it by setting the value instead of
> calling new Text("") on every output.
>
> Thanks,
> Dave
>
> On 15 October 2012 21:39, Alberto Cordioli <cordioli.alberto@gmail.com> wrote:
>> Hi Dave,
>>
>> thanks for your reply. Now it's more clear; in fact the code that I
>> wrote is inspired to the old api, where the behavior is another.
>> So, how can I achieve the same behavior as the old api? I need the
>> second field of the first key object to stay the same among the
>> iterations, in order to compare it with other objects. Do I have to
>> clone the object?
>>
>>
>> Thanks.
>>
>> On 15 October 2012 21:27, Dave Beech <dbeech@apache.org> wrote:
>>> Hi Alberto
>>>
>>> The iterator you are looping over in your reduce method isn't a
>>> self-contained list of values. What's actually happening is that
>>> you're iterating through *part* of the sorted key/value set that was
>>> sent to that reduce node, and it is the grouping comparator that
>>> decides when to break that loop and call reduce again on the next key.
>>>
>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>> the values, what's actually happening is this pointer to the
>>> associated key data moves with it - and you're seeing it change.
>>>
>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>> API you get the first key, and it appears to stay the same during the
>>> loop.
>>>
>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>> don't act the same.
>>>
>>> Hope that helps,
>>> Dave
>>>
>>> On 15 October 2012 20:11, Alberto Cordioli <cordioli.alberto@gmail.com>
wrote:
>>>> Hi all,
>>>>
>>>> a very strange thing is happening with my hadoop program.
>>>> My map simply emits tuples with a custom object as key (which
>>>> implement WritableComparable).
>>>> The object is made of 2 fields, and I implement my partitioner and
>>>> groupingclass in such a way that only the first field is taken into
>>>> account.
>>>> The second field is just a tag and could be 1 or 2.
>>>>
>>>> This is the reducer's snippet:
>>>>
>>>> tag = key.getSecondField();
>>>> Iterator it1 = values.iterator();
>>>> while(it1.hasNext()){
>>>>         it1.next();
>>>>         collector.emit(new Text("dummy"), tag);
>>>> }
>>>>
>>>> I would expect in my output all the lines with:
>>>> dummy       1
>>>> ...
>>>> dummy       1
>>>>
>>>> but actually the value of tag changes in time and I obtain this type of output:
>>>>
>>>> dummy    1
>>>> ...
>>>> dummy    1
>>>> dummy    2
>>>> ...
>>>> dummy    2
>>>>
>>>>
>>>> Someone could explain me way, please?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Mime
View raw message