pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
Date Sat, 27 Jun 2009 19:52:47 GMT

    [ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724880#action_12724880

Alan Gates commented on PIG-793:

The cost for storing data raw is:

16 bytes for the tuple object
12 bytes for the byte array object
12 bytes + 2 bytes/field for a short[] to hold offsets into the byte[]
Then as you say above for the data itself, plus 1 byte per field to store type and nullness.

So our example tuple would take ~85 bytes.

But in general, yes you can do much better with raw bytes.  We played with this some and we
found that the cost of Tuple.get/set goes up 10x because of the need to turn the bytes into
objects.  In a typical query this added about 2x to the overall run time.  The solution to
this would be to rewrite all the Pig operators to work on byte data instead of objects.  This
is a large project, and doesn't solve the UDFs.  We could pay the performance penalty for
UDFs, or we could change the UDFs to take byte data.  Currently many of our users are asking
for the ability to write UDFs in Python or other scripting languages.  If we instead go the
other way and basically make them write C style Java I don't think that will be popular.

What we're playing with now (changing ArrayList<Object> to Object[] and String to Text)
will reap somewhere around 50% of the benefits in terms of memory savings as going to fully
raw data.  But it's around 10% of the work.  I'm not excluding moving to storing everything
in a byte[] in the future.  But I want to see if for a little work now we can get a descent
amount of improvement.

> Improving memory efficiency of Tuple implementation
> ---------------------------------------------------
>                 Key: PIG-793
>                 URL: https://issues.apache.org/jira/browse/PIG-793
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since since each
object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than ArrayList.
> There might be more.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message