hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Tang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
Date Fri, 01 May 2009 23:59:30 GMT

    [ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705188#action_12705188
] 

Hong Tang commented on PIG-793:
-------------------------------

Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only instantiate datums
when get/set calls are made. This would help if we are moving tuples from one container to
another container.
{code}
class LazyTuple implements Tuple {
  ArrayList<Object> fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), offset(), and length()
), and use a DataByteArrayFactory to create instances in two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to keep a private
copy of the buffer.
## DataByteArrayCreateShared(). if the input buffer can be shared with the data byte array
object. In this case, the contract would be that caller will no longer access the portion
of byte array from offset to offset+length (exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in short/short).
- An implementation for large buffers (offset/length are int/int, and length is larger enough)

Note that the change to DataByteArray would break the current semantics where the offset is
always 0, and length is always the length of the buffer.


> Improving memory efficiency of Tuple implementation
> ---------------------------------------------------
>
>                 Key: PIG-793
>                 URL: https://issues.apache.org/jira/browse/PIG-793
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since since each
object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than ArrayList.
> There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message