pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1875) Keep tuples serialized to limit spilling and speed it when it happens
Date Tue, 01 Mar 2011 23:57:37 GMT

    [ https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001192#comment-13001192

Alan Gates commented on PIG-1875:

Thoughts so far on a possible implementation:
As mentioned above, the tuple would stay in serialized form when read as part of the reduce
iterator.  It would later deserialize itself the first time one of its methods was invoked.
 If the serialized tuple size was of a sufficient size, it could even be compressed before
being serialized by the map.

This can be done using a new implementation of tuple that wraps an existing implementation.
 If we call the new tuple MToRTuple and assume that the existing one is DefaultTuple, then
MToRTuple.write() would call DefaultTuple.write(), passing it an input stream connected to
a bytearray.  It would then write itself to disk.  MToRTuple.readFields would just read the
data from disk into a bytearray.  It would not decompress it.  Then the first time a method
was called to access the tuple, it would call DefaultTuple.readFields() on the bytearray to
deserialize the data.  From that point on it would pass calls through to the underlying default

There are a few open questions to be answered here:  

# Will raw comparators work ok with compressed data?  We may have to not compress the keys,
but that's probably ok.  In my experiments using gzip the break even point for compression
was 150 bytes, and it didn't get useful until 200 bytes or so.  I'm guessing key lengths rarely
exceed 200 bytes.
# What is the CPU overhead of compressing data?  Will we be slowing down jobs that never need
to spill?
# Is there any significant advantages to this in cases where the combiner or Accumulator are
already use?  Would we want to only turn it on if neither of these apply?

To really make use of a feature like this, we would need to integrate it with a couple of
other changes:

# One place I see this being a huge win is for joins since we have to keep n - 1 streams from
the join in memory.  If we could change join to only deserialize the join key and keep the
value serialized/compressed we may be able to significantly reduce the cases where join spills
# To really make use of this we need to throw away the deserialized tuple as soon as we're
done with it.  We could do this by forcing the tuple to deserialize every time (way too slow),
or by creating a destructive iterator where the element is thrown away once it's read.  This
would work for the left most stream in a join (though not any other streams in an n-way join
with n > 2).  If we wanted to use this for aggregate UDFs we could also allow them to tell
us whether they could use a destructive iterator.

> Keep tuples serialized to limit spilling and speed it when it happens
> ---------------------------------------------------------------------
>                 Key: PIG-1875
>                 URL: https://issues.apache.org/jira/browse/PIG-1875
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Priority: Minor
> Currently Pig reads records off of the reduce iterator and immediately deserializes them
into Java objects.  This takes up much more memory than serialized versions, thus Pig spills
sooner then if it stored them in serialized form.  Also, if it does have to spill, it has
to serialize them again, and then again deserialize them after reading from the spill file.
> We should explore storing them in memory serialized when they are read off of the reduce

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message