pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4449) Optimize the case of Order by + Limit in nested foreach
Date Fri, 19 May 2017 23:33:04 GMT

    [ https://issues.apache.org/jira/browse/PIG-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018164#comment-16018164

Rohini Palaniswamy commented on PIG-4449:

PIG-5211 has addressed this problem. The current implementation in PIG-5211 adds to PriorityQueue
and then removes an element if it exceeds size limit.  Leaving this jira open to address further
optimizations that I had thought of for this issue.

1) Change to using https://google.github.io/guava/releases/11.0/api/docs/com/google/common/collect/TreeMultiset.html
. Internally it is Map<E, AtomicInteger> which keeps count of duplicate entries. That
should save space. Also it allows peeking of first and last entry. So after reaching the limit
we can check if the new element to be added is greater than the last entry in case of ascending
sort and or lesser than the smaller entry in case of descending sort and avoid adding in the
first place.
2) Use https://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html in case it is a DISTINCT
3) Add support for spill. Have seen cases where people do LIMIT 100000.

> Optimize the case of Order by + Limit in nested foreach
> -------------------------------------------------------
>                 Key: PIG-4449
>                 URL: https://issues.apache.org/jira/browse/PIG-4449
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>              Labels: Performance
> This is one of the very frequently used patterns
> {code}
> grouped_data_set = group data_set by id;
> capped_data_set = foreach grouped_data_set
> {
>   ordered = order joined_data_set by timestamp desc;
>   capped = limit ordered $num;
>  generate flatten(capped);
> };
> {code}
> But this performs very poorly when there are millions of rows for a key in the groupby
with lot of spills.  This can be easily optimized by pushing the limit into the InternalSortedBag
and maintain only $num records any time and avoid memory pressure.

This message was sent by Atlassian JIRA

View raw message