pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1875) Keep tuples serialized to limit spilling and speed it when it happens
Date Wed, 02 Mar 2011 00:05:37 GMT

     [ https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Gates updated PIG-1875:

    Attachment: mrtuple.patch

Here's a first pass at what MToRTuple might look like.  I've done some basic testing to assure
this works, but nothing comprehensive.

In test runs where I serialized 100k tuples, wrote them to disk, and read them back I got
the following results:

time to write to disk:       81.93 sec
size on disk:                98M
time to read from disk:      12.62 sec
size in memory (after read): 238M

time to write to disk:       10.49 sec
size on disk:                58M
time to read from disk:      1.10 sec
size in memory (after read): 57M

So roughly 1/4 the memory consumption and ~10x speedup on disk reads and writes.

> Keep tuples serialized to limit spilling and speed it when it happens
> ---------------------------------------------------------------------
>                 Key: PIG-1875
>                 URL: https://issues.apache.org/jira/browse/PIG-1875
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Priority: Minor
>         Attachments: mrtuple.patch
> Currently Pig reads records off of the reduce iterator and immediately deserializes them
into Java objects.  This takes up much more memory than serialized versions, thus Pig spills
sooner then if it stored them in serialized form.  Also, if it does have to spill, it has
to serialize them again, and then again deserialize them after reading from the spill file.
> We should explore storing them in memory serialized when they are read off of the reduce

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message