hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <te...@yahoo-inc.com>
Subject Re: A proposal for changing pig's memory management
Date Fri, 15 May 2009 18:23:11 GMT
With a constraint that all scalar values in a tuple should fit into a single
buffer, the values will always have to be copied whenever a tuple contents
need to be copied to a new tuple after a relational operation.

The overhead of copying is not large for numeric types compared to the
existing implementation, because we already copy the object references. But
it can be  large overhead for chararray/bytearray data types that are long
enough.

To avoid this performance penalty, we should not require these larger
datatypes to be stored in the same buffer, and maybe follow the design in
current implemenation for those, ie store them in java objects.
To prevent the bloating issue when 8byte chars are stored in String objects,
we can delay their conversion into String objects and store them like
bytearray until some String operation needs to be done. For any memory
intensive operations like join, we can store them again as bytearray.
I assume that in the current design you would be doing something similar
(treating chararray the same way as bytearray) until String operations need
to be done.

Thanks,
Thejas




On 5/14/09 5:33 PM, "Alan Gates" <gates@yahoo-inc.com> wrote:

> http://wiki.apache.org/pig/PigMemory
> 
> Alan.


Mime
View raw message