hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags
Date Mon, 26 Oct 2009 21:20:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770206#action_12770206

Alan Gates commented on PIG-1037:


In InternalSortedBag.add, you are calculating the average size every time you add a tuple
for the first 100 tuples.  Rather than do the calculations every time, wouldn't it be better
wait until you get to 100 tuples then calculate the average?  This would miss the case where
you can store less than 100 tuples, but that seems unlikely.

Some of the comments in InternalSortedBag that were copied over from the previous code, such
as dealing with spills in the midst of reading, are no longer true.  They should be removed
since they will cause confusion on how the code works.

I think the synchronized blocks in InternalSortedBag can be removed.  They were there before
because spills could be triggered by a separate thread.  Since that is no longer true we should
be able to remove these.  This will remove a lock/unlock on every read of a record out of
the bag and should provide some speed up.

> better memory layout and spill for sorted and distinct bags
> -----------------------------------------------------------
>                 Key: PIG-1037
>                 URL: https://issues.apache.org/jira/browse/PIG-1037
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Ying He
>         Attachments: PIG-1037.patch, PIG-1037.patch2

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message