hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
Date Fri, 26 Jun 2009 16:11:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724594#action_12724594

Alan Gates commented on PIG-793:

Using jmap, I've been toying around with our DefaultTuple implementation to see how much memory
it takes.  For a tuple with 3 elements, one int, one double, one 20 character string I see
it taking:

16 bytes for the Tuple object
24 bytes for the ArrayList<Object> in the tuple
~26 bytes for pointers in the ArrayList
16 bytes for the Integer
16 bytes for the Double
24 bytes for the String overhead
~52 bytes for the String data

Pointers in the ArrayList and character data in the String appear to be padded and vary somewhat
depending on how I run the experiments.

I played with changing the ArrayList<Object> in DefaultTuple to an Object[].  There
are two advantages, the 24 bytes of ArrayList shrinks to 12 for the Object[], and as I wrote
it to always have the Object[] be exactly the right size there is no padding cost.  The downside
to this is append becomes a more expensive operation because it's growing the Object[] by
one every time.  However, after some investigation I believe that most places we use append
can be changed to use set, thus alieviating this issue.  I'm working on a patch to change
this.  Once I have that done I'll report on how that changes memory usage as well as any performance
gains or losses.

A related item I would like to look into is using Hadoop's Text instead of String to back
chararray.  Text takes 16 bytes of overhead + 36 bytes for string data to store 20 characters,
versus the 24 / 52 of String.  Obviously this would be a huge change and needs to have very
impressive results to be considered.  I'll play with it and report results here.

> Improving memory efficiency of Tuple implementation
> ---------------------------------------------------
>                 Key: PIG-793
>                 URL: https://issues.apache.org/jira/browse/PIG-793
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since since each
object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than ArrayList.
> There might be more.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message