pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1875) Keep tuples serialized to limit spilling and speed it when it happens
Date Wed, 02 Mar 2011 21:05:37 GMT

    [ https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001640#comment-13001640

Thejas M Nair commented on PIG-1875:

This idea is likely to speed up the queries where pig ends up spilling to disk today. 

But this approach will have larger memory footprint in cases where we would not have ended
spilling to disk, if I assume that deserializing more than once is going to be very expensive.
Maybe, this can be turned on for a stream once we see a need to spill. The first spill will
not not end up using this approach, if we do that. This is hopefully easy to do, but i haven't

For example, this approach is not going to be useful for the leftmost stream in a join, it
will make sense to keep only the deserialized version in memory. For the other streams when
we know we are likely to spill to disk, pig can be more aggressive in destroying the deserialized
copy. The bag holding the tuple can be in charge of destroying the deserialized copy.

> Keep tuples serialized to limit spilling and speed it when it happens
> ---------------------------------------------------------------------
>                 Key: PIG-1875
>                 URL: https://issues.apache.org/jira/browse/PIG-1875
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Priority: Minor
>             Fix For: 0.10
>         Attachments: mrtuple.patch
> Currently Pig reads records off of the reduce iterator and immediately deserializes them
into Java objects.  This takes up much more memory than serialized versions, thus Pig spills
sooner then if it stored them in serialized form.  Also, if it does have to spill, it has
to serialize them again, and then again deserialize them after reading from the spill file.
> We should explore storing them in memory serialized when they are read off of the reduce

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message