avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: Map output records/reducer input records mismatch
Date Wed, 17 Aug 2011 17:18:15 GMT
On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <vyacheslav.zholudev@gmail.com>
wrote:

> btw,
> 
> I was thinking to try it with Utf8 objects instead of strings and I wanted to
> reuse the same Utf8 object instead of creating new from String upon each map()
> call.
> Why does not the Utf8 class have a method for setting bytes via a String
> object?

We could add that, but it won't help performance much in this case since the
performance improvement from reuse has more to do with the underlying byte[]
than the Utf8 object.
The expensive part of String is the conversion from an underlying char[] to
a byte[] (Utf8.getBytesFor()), so this would not help much.  It would
probably be faster to use String directly rather than wrap it with Utf8 each
time.

Rather than have a static method like the below, I would propose that an
instance method be made that does the same thing, something like

public void setValue(String val) {
   // gets bytes, replaces private byte array, replaces cached string ‹ no
system array copy.
} 

which would be much more efficient.


> 
> I created the following code snippet:
> 
>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>         container.setByteLength(strBytes.length);
>         System.arraycopy(strBytes, 0, container.getBytes(), 0,
> strBytes.length);
>         return container;
>     }
> 
> Would that be useful if this code is encapsulated into the Utf8 class?
> 
> Best,
> Vyacheslav
> 
> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
> 
>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <vyacheslav.zholudev@gmail.com>
>> wrote:
>> 
>>> Hi, Scott,
>>> 
>>> thanks for your reply.
>>> 
>>>> What Avro version is this happening with? What JVM version?
>>> 
>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>> to look up.
>>> 
>>>> 
>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>> if
>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>> Java 6 too, just not as many as the recent news on Java7).
>>>> 
>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>> information related to that issue would be welcome.
>>> 
>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>> explanations of the issue besides it being something like AVRO-782?
>> 
>> What is your key type (map output schema, first type argument of Pair)?
>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>> this point, I haven't looked into it in depth with a good reproducible
>> case.  I have my suspicions with how recycling of the key works since Utf8
>> is mutable and its backing byte[] can end up shared.
>> 
>> 
>> 
>>> 
>>> Thanks a lot,
>>> Vyacheslav
>>> 
>>>> 
>>>> Thanks!
>>>> 
>>>> -Scott
>>>> 
>>>> 
>>>> 
>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>> <vyacheslav.zholudev@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I'm having multiple hadoop jobs that use the avro mapred API.
>>>>> Only in one of the jobs I have a visible mismatch between a number of
>>>>> map
>>>>> output records and reducer input records.
>>>>> 
>>>>> Does anybody encountered such a behavior? Can anybody think of possible
>>>>> explanations of this phenomenon?
>>>>> 
>>>>> Any pointers/thoughts are highly appreciated!
>>>>> 
>>>>> Best,
>>>>> Vyacheslav
>>>> 
>>>> 
>>> 
>>> Best,
>>> Vyacheslav
>>> 
>>> 
>>> 
>> 
>> 
> 



Mime
View raw message