avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ey-chih chow <eyc...@hotmail.com>
Subject RE: avro object reuse
Date Fri, 10 Jun 2011 19:07:51 GMT

We have many MR jobs running on production, but only one of them shows this kind of behavior.
 Is there any specific condition that corruption will occur? 

From: scott@richrelevance.com
To: user@avro.apache.org
Date: Fri, 10 Jun 2011 11:11:55 -0700
Subject: Re: avro object reuse




Corruption can occur in I/O busses and RAM.  Does this tend to fail on the same nodes, or
any node randomly?  Since it does not fail consistently, this makes me suspect some sort of
corruption even more.
I suggest turning on stack traces for fatal throwables.  This shouldn't hurt production performance
since they don't happen regularly and break the task anyway.
Of the heap dumps seen so far, the primary consumption is byte[] and no more than 300MB. 
How large are your java heaps?
On 6/10/11 10:53 AM, "ey-chih chow" <eychih@hotmail.com> wrote:

Since this was in production, we did not turn on stack trace.  Also, it was highly unlikely
that there was any data corrupted because, if one mapper failed due to out of memory, the
system started another one and went through all the data.

From: scott@richrelevance.com
To: user@avro.apache.org
Date: Thu, 9 Jun 2011 17:43:02 -0700
Subject: Re: avro object reuse

If the exception is happening while decoding, it could be due to corrupt data. Avro allocates
a List preallocated to the size encoded, and I've seen corrupted data cause attempted allocations
of arrays too large for the heap.
On 6/9/11 4:58 PM, "Scott Carey" <scott@richrelevance.com> wrote:
What is the stack trace on the out of memory exception?

On 6/9/11 4:45 PM, "ey-chih chow" <eychih@hotmail.com> wrote:

We configure more than 100MB for MapReduce to do sorting.  Memory we allocate for doing other
things in the mapper actually is larger, but, for this job, we always get out-of-meory exceptions
and the job can not complete.  We try to find out if there is a way to avoid this problem.
Ey-Chih Chow 

From: scott@richrelevance.com
To: user@avro.apache.org
Date: Thu, 9 Jun 2011 15:42:10 -0700
Subject: Re: avro object reuse

The most likely candidate for creating many instances of BufferAccessor and ByteArrayByteSource
is BinaryData.compare() and BinaryData.hashCode().  Each call will create one of each (hash)
or two of each (compare).  These are only 32 bytes per instance and quickly become garbage
that is easily cleaned up by the GC.  
The below have only 32 bytes each and 8MB total.On the other hand,  the byte[]'s appear to
be about 24K each on average and are using 100MB.  Is this the size of your configured MapReduce
sort MB?
On 6/9/11 3:08 PM, "ey-chih chow" <eychih@hotmail.com> wrote:

We did more monitoring.  At one instance, we got the following histogram via Jmap.  The question
is why there are so many instances of BinaryDecoder$BufferAccessor and BinaryDecoder$ByteArrayByteSource.
 How to avoid this?  Thanks. 

Object Histogram:

num       #instances    #bytes  Class description
--------------------------------------------------------------------------
1:              4199    100241168       byte[]
2:              272948  8734336 org.apache.avro.io.BinaryDecoder$BufferAccessor
3:              272945  8734240 org.apache.avro.io.BinaryDecoder$ByteArrayByteSource
4:              2093    5387976 int[]
5:              23762   2822864 * ConstMethodKlass
6:              23762   1904760 * MethodKlass
7:              39295   1688992 * SymbolKlass
8:              2127    1216976 * ConstantPoolKlass
9:              2127    882760  * InstanceKlassKlass
10:             1847    742936  * ConstantPoolCacheKlass
11:             9602    715608  char[]
12:             1072    299584  * MethodDataKlass
13:             9698    232752  java.lang.String
14:             2317    222432  java.lang.Class
15:             3288    204440  short[]
16:             3167    156664  * System ObjArray
17:             2401    57624   java.util.HashMap$Entry
18:             666     53280   java.lang.reflect.Method
19:             161     52808   * ObjArrayKlassKlass
20:             1808    43392   java.util.Hashtable$Entry


From: eychih@hotmail.com
To: user@avro.apache.org
Subject: RE: avro object reuse
Date: Wed, 1 Jun 2011 15:14:03 -0700




We use a lot of toString() call on the avro Utf8 object.  Will this cause Jackson call?  Thanks.
Ey-Chih 

From: scott@richrelevance.com
To: user@avro.apache.org
Date: Wed, 1 Jun 2011 13:38:39 -0700
Subject: Re: avro object reuse

This is great info.
Jackson should only be used once when the file is opened, so this is confusing from that point
of view.  Is something else using Jackson or initializing an Avro JsonDecoder frequently?
 There are over 100000 Jackson DeserializationConfig objects.
Another place that parses the schema is in AvroSerialization.java.  Does the Hadoop getDeserializer()
API method get called once per job, or per record?  If this is called more than once per map
job, it might explain this.
In principle, Jackson is only used by a mapper during initialization.  The below indicates
that this may not be the case or that something outside of Avro is causing a lot of Jackson
JSON parsing. 
Are you using something that is converting the Avro data to Json form?  toString() on most
Avro datum objects will do a lot of work with Jackson, for example — but the below are deserializer
objects not serializer objects so that is not likely the issue.
On 6/1/11 11:34 AM, "ey-chih chow" <eychih@hotmail.com> wrote:

We ran jmap on one of our mapper and found the top usage as follows:
num 	  #instances	#bytes	Class description--------------------------------------------------------------------------1:
	24405	291733256	byte[]2:		6056	40228984	int[]3:		388799	19966776	char[]4:		101779	16284640
org.codehaus.jackson.impl.ReaderBasedParser5:		369623	11827936	java.lang.String6:		111059
8769424	java.util.HashMap$Entry[]7:		204083	8163320	org.codehaus.jackson.impl.JsonReadContext8:
	211374	6763968	java.util.HashMap$Entry9:		102551	5742856	org.codehaus.jackson.util.TextBuffer10:
	105854	5080992	java.nio.HeapByteBuffer11:		105821	5079408	java.nio.HeapCharBuffer12:		104578
5019744	java.util.HashMap13:		102551	4922448	org.codehaus.jackson.io.IOContext14:		101782
4885536	org.codehaus.jackson.map.DeserializationConfig15:		101783	4071320	org.codehaus.jackson.sym.CharsToNameCanonicalizer16:
	101779	4071160	org.codehaus.jackson.map.deser.StdDeserializationContext17:		101779	4071160
java.io.StringReader18:		101754	4070160	java.util.HashMap$KeyIterator
It looks like Jackson eats up a lot of memory.  Our mapper reads in files of the avro format.
 Does avro use Jackson a lot in reading the avro files?  Is there any way to improve this?
 Thanks.
Ey-Chih Chow
From: scott@richrelevance.com
To: user@avro.apache.org
Date: Tue, 31 May 2011 18:26:23 -0700
Subject: Re: avro object reuse

All of those instances are short-lived.   If you are running out of memory, its not likely
due to object reuse.  This tends to cause more CPU time in the garbage collector, but not
out of memory conditions.  This can be hard to do on a cluster, but grabbing 'jmap –histo'
output from a JVM that has a larger-than-expected JVM heap usage can often be used to quickly
identify the cause of memory consumption issues.
I'm not sure if AvroUtf8InputFormat can safely re-use its instances of Utf8 or not.

On 5/31/11 5:40 PM, "ey-chih chow" <eychih@hotmail.com> wrote:

I actually looked into Avro code to find out how Avro does object reuse.  I looked at AvroUtf8InputFormat
and got the following question.  Why a new Utf8 object has to be created each time the method
next(AvroWrapper<Utf8> key, NullWritable value) is called ?  Will this eat up too much
memory when we call next(key, value) many times?  Since Utf8 is mutable, can we just create
one Utf8 object for all the calls to next(key, value)?  Will this save memory?  Thanks.
Ey-Chih Chow 

From: eychih@hotmail.com
To: user@avro.apache.org
Subject: avro object reuse
Date: Tue, 31 May 2011 10:38:39 -0700




Hi, 
We have several mapreduce jobs using avro.  They take too much memory when running on production.
 Can anybody suggest some object reuse techniques to cut down memory usage?  Thanks.
Ey-Chih Chow 		 	   		   		 	   		   		 	   		   		 	   		   		 	   		   		 	   		  
Mime
View raw message