hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alberto Cordioli <cordioli.albe...@gmail.com>
Subject Re: GroupingComparator
Date Tue, 16 Oct 2012 09:45:52 GMT
Yes, I know that keeping an in-memory collection ins't a good idea.
The problem is that I need to perform a join, so there is no other
possibilities! :(

Cheers,
Alberto

On 16 October 2012 11:08, Dave Beech <dbeech@apache.org> wrote:
> Great! Glad the problem is solved.
>
> You're right - the object returned by iterator.next() is re-used too.
> So yes, you would need to clone in this case and you'd have no choice
> but to create new objects.
>
> Please be sure though that you really do need to store values in a
> list to do what you're trying to do. Keeping an in-memory collection
> might not be very scalable. Obviously, if you've got loads of RAM or
> not a lot of data (or both), then that's fine! Just something else to
> think about...
>
> Cheers,
> Dave
>
> On 16 October 2012 09:42, Alberto Cordioli <cordioli.alberto@gmail.com> wrote:
>> Thanks Dave.
>> You solved my problem. Just a little question about your tip:
>> I suppose also the value returned by iterator.next() is re-used.
>> So if want to store some values of the Iterable list in the reducer, I
>> should create a List and put cloned objects inside it.
>> In this case there is no possibility to avoid the "new" operator, right?
>>
>>
>>
>> On 15 October 2012 22:49, Dave Beech <dbeech@apache.org> wrote:
>>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>>> or IntWritable instance variable. You wouldn't need to clone the whole
>>> key.
>>>
>>> Then, instead of tag = key.getSecondField() you'd say
>>> tag.set(key.getSecondField().get());
>>> I don't know what type of object tag is (if it's Text you'll say
>>> toString() rather than get()), but you see what I mean.
>>>
>>> Also - just a tip - try to avoid creating new objects wherever
>>> possible. You'll get better performance if you create one Text object
>>> as an instance variable and re-use it by setting the value instead of
>>> calling new Text("") on every output.
>>>
>>> Thanks,
>>> Dave
>>>
>>> On 15 October 2012 21:39, Alberto Cordioli <cordioli.alberto@gmail.com>
wrote:
>>>> Hi Dave,
>>>>
>>>> thanks for your reply. Now it's more clear; in fact the code that I
>>>> wrote is inspired to the old api, where the behavior is another.
>>>> So, how can I achieve the same behavior as the old api? I need the
>>>> second field of the first key object to stay the same among the
>>>> iterations, in order to compare it with other objects. Do I have to
>>>> clone the object?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> On 15 October 2012 21:27, Dave Beech <dbeech@apache.org> wrote:
>>>>> Hi Alberto
>>>>>
>>>>> The iterator you are looping over in your reduce method isn't a
>>>>> self-contained list of values. What's actually happening is that
>>>>> you're iterating through *part* of the sorted key/value set that was
>>>>> sent to that reduce node, and it is the grouping comparator that
>>>>> decides when to break that loop and call reduce again on the next key.
>>>>>
>>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>>> the values, what's actually happening is this pointer to the
>>>>> associated key data moves with it - and you're seeing it change.
>>>>>
>>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>>> API you get the first key, and it appears to stay the same during the
>>>>> loop.
>>>>>
>>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>>> don't act the same.
>>>>>
>>>>> Hope that helps,
>>>>> Dave
>>>>>
>>>>> On 15 October 2012 20:11, Alberto Cordioli <cordioli.alberto@gmail.com>
wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> a very strange thing is happening with my hadoop program.
>>>>>> My map simply emits tuples with a custom object as key (which
>>>>>> implement WritableComparable).
>>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>>> groupingclass in such a way that only the first field is taken into
>>>>>> account.
>>>>>> The second field is just a tag and could be 1 or 2.
>>>>>>
>>>>>> This is the reducer's snippet:
>>>>>>
>>>>>> tag = key.getSecondField();
>>>>>> Iterator it1 = values.iterator();
>>>>>> while(it1.hasNext()){
>>>>>>         it1.next();
>>>>>>         collector.emit(new Text("dummy"), tag);
>>>>>> }
>>>>>>
>>>>>> I would expect in my output all the lines with:
>>>>>> dummy       1
>>>>>> ...
>>>>>> dummy       1
>>>>>>
>>>>>> but actually the value of tag changes in time and I obtain this type
of output:
>>>>>>
>>>>>> dummy    1
>>>>>> ...
>>>>>> dummy    1
>>>>>> dummy    2
>>>>>> ...
>>>>>> dummy    2
>>>>>>
>>>>>>
>>>>>> Someone could explain me way, please?
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alberto Cordioli
>>>>
>>>>
>>>>
>>>> --
>>>> Alberto Cordioli
>>
>>
>>
>> --
>> Alberto Cordioli



-- 
Alberto Cordioli

Mime
View raw message