avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vyacheslav Zholudev <vyacheslav.zholu...@gmail.com>
Subject Re: Map output records/reducer input records mismatch
Date Wed, 17 Aug 2011 22:59:36 GMT
There is a possible reason:
It seems that there is an upper limit of 10,001 records per reduce input group. (or is there
a setting?)


If I output one million rows with the same key, I get:
Map output records: 1,000,000
Reduce input groups: 1
Reduce input records: 10,001

If I output one million rows with 20 different keys, I get:
Map output records: 1,000,000
Reduce input groups: 20
Reduce input records: 200,020

If I output one million rows with unique keys, I get:
Map output records: 1,000,000
Reduce input groups: 1,000,000
Reduce input records: 1,000,000


Btw., I am running on 5 nodes with total map task capacity of 10 and total reduce task capacity
of 10.

Thanks,
Vyacheslav

On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:

> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <vyacheslav.zholudev@gmail.com> wrote:
> 
>> btw,
>> 
>> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse
the same Utf8 object instead of creating new from String upon each map() call.
>> Why does not the Utf8 class have a method for setting bytes via a String object?
> 
> 
> We could add that, but it won't help performance much in this case since the performance
improvement from reuse has more to do with the underlying byte[] than the Utf8 object.
> The expensive part of String is the conversion from an underlying char[] to a byte[]
(Utf8.getBytesFor()), so this would not help much.  It would probably be faster to use String
directly rather than wrap it with Utf8 each time.
> 
> Rather than have a static method like the below, I would propose that an instance method
be made that does the same thing, something like 
> 
> public void setValue(String val) {
>    // gets bytes, replaces private byte array, replaces cached string — no system array
copy.
> } 
> 
> which would be much more efficient.  
> 
> 
>> 
>> I created the following code snippet: 
>> 
>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>         container.setByteLength(strBytes.length);
>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length);
>>         return container;
>>     }
>> 
>> Would that be useful if this code is encapsulated into the Utf8 class?
>> 
>> Best,
>> Vyacheslav
>> 
>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>> 
>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <vyacheslav.zholudev@gmail.com>
>>> wrote:
>>> 
>>>> Hi, Scott,
>>>> 
>>>> thanks for your reply.
>>>> 
>>>>> What Avro version is this happening with? What JVM version?
>>>> 
>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>> to look up.
>>>> 
>>>>> 
>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>> if
>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>> 
>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>> information related to that issue would be welcome.
>>>> 
>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>> explanations of the issue besides it being something like AVRO-782?
>>> 
>>> What is your key type (map output schema, first type argument of Pair)?
>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>> this point, I haven't looked into it in depth with a good reproducible
>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>> is mutable and its backing byte[] can end up shared.
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks a lot,
>>>> Vyacheslav
>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> -Scott
>>>>> 
>>>>> 
>>>>> 
>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>> <vyacheslav.zholudev@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm having multiple hadoop jobs that use the avro mapred API.
>>>>>> Only in one of the jobs I have a visible mismatch between a number
of
>>>>>> map
>>>>>> output records and reducer input records.
>>>>>> 
>>>>>> Does anybody encountered such a behavior? Can anybody think of possible
>>>>>> explanations of this phenomenon?
>>>>>> 
>>>>>> Any pointers/thoughts are highly appreciated!
>>>>>> 
>>>>>> Best,
>>>>>> Vyacheslav
>>>>> 
>>>>> 
>>>> 
>>>> Best,
>>>> Vyacheslav
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 

Best,
Vyacheslav




Mime
View raw message