hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: A proposal for changing pig's memory management
Date Tue, 19 May 2009 23:31:50 GMT
We definitely do not want to follow the current design of keeping  
chararrays and bytearrays as separate objects.  It is that overhead of  
an object for each field that we are trying to avoid.

The reason for constraining a tuple to store its data in one  
TupleBuffer is to limit the size of the Tuple object.  If the data can  
span TupleBuffers, then a Tuple has to a TupleBuffer reference for  
each field, almost doubling the size of a Tuple.

It is worth noting that a number of pig operators (sort, filter,  
distinct) need not copy any data.  Some have to copy  (foreach, though  
this could be optimized not to in certain cases such as simple  
projections), and some could get away without copying if tuples could  
reference fields across buffers (join, union, cross).

I think we should do some experimentation with this and see:

1) What is the memory saving from moving from objects to buffers, ie  
is this worth it at all.
2) What is the additional memory cost of storing a TupleBuffer  
reference per field.
3) What is the performance penalty of copying data on joins, unions,  

Once these are known it should be easier to make trade off decisions.

There is one other option.  I had said that it would be better to have  
a few very large (on the order of 10M) buffers.  The reasons I  
considered that was that I didn't want so many buffers themselves that  
managing them became a burden on the system, and that we have to  
somehow handle the case of chararray, bytearray, or map fields that  
won't fit in a single TupleBuffer (assumably by storing those as an  
object instead of in the buffer).  The larger we make the buffers the  
less we encounter this issue.

Instead of using large buffers we could use smaller ones.  If we  
capped the size of a buffer at 65K and the number of buffers a single  
tuple could reference at 65K, then a tuple could still see 4G of  
memory but still only use 4 bytes per field to point to the data.   
This way join operations could be done without copying the data.  This  
option should be experimented with as well.  It may be that using  
smaller buffers is better since the cost of reading and writing them  
on disk will be less.


On May 15, 2009, at 11:23 AM, Thejas Nair wrote:

> With a constraint that all scalar values in a tuple should fit into  
> a single
> buffer, the values will always have to be copied whenever a tuple  
> contents
> need to be copied to a new tuple after a relational operation.
> The overhead of copying is not large for numeric types compared to the
> existing implementation, because we already copy the object  
> references. But
> it can be  large overhead for chararray/bytearray data types that  
> are long
> enough.
> To avoid this performance penalty, we should not require these larger
> datatypes to be stored in the same buffer, and maybe follow the  
> design in
> current implemenation for those, ie store them in java objects.
> To prevent the bloating issue when 8byte chars are stored in String  
> objects,
> we can delay their conversion into String objects and store them like
> bytearray until some String operation needs to be done. For any memory
> intensive operations like join, we can store them again as bytearray.
> I assume that in the current design you would be doing something  
> similar
> (treating chararray the same way as bytearray) until String  
> operations need
> to be done.
> Thanks,
> Thejas
> On 5/14/09 5:33 PM, "Alan Gates" <gates@yahoo-inc.com> wrote:
>> http://wiki.apache.org/pig/PigMemory
>> Alan.

View raw message