hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sriranjan Manjunath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1102) Collect number of spills per job
Date Wed, 23 Dec 2009 20:38:29 GMT

    [ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794217#action_12794217

Sriranjan Manjunath commented on PIG-1102:

(3) refers to the case where we try to guess the number of records that fit into memory and
start spilling the other records. InternalCachedBag.java addresses this case:

+                if (cacheLimit!= 0 && mContents.size() % cacheLimit == 0) {
+                    /* Increment the spill count*/
+                    incSpillCount(PigCounters.PROACTIVE_SPILL_COUNT);                   

+                }

cacheLimit holds the number of records that can be held in memory whereas mContents is the
tuple that holds all the records. Here, I do not increment the counter for every record. Instead
I count every n'th record, n being the cacheLimit.

This however, does not increment the counter by the buffer size. Incrementing it by the buffer
size will give us a value which approximately equal to the number of spilled records.

> Collect number of spills per job
> --------------------------------
>                 Key: PIG-1102
>                 URL: https://issues.apache.org/jira/browse/PIG-1102
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Sriranjan Manjunath
>             Fix For: 0.7.0
>         Attachments: PIG_1102.patch, PIG_1102.patch.1
> Memory shortage is one of the main performance issues in Pig. Knowing when we spill do
the disk is useful for understanding query performance and also to see how certain changes
in Pig effect that.
> Other interesting stats to collect would be average CPU usage and max mem usage but I
am not sure if this information is easily retrievable.
> Using Hadoop counters for this would make sense.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message