pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-3395) Large filter expression makes Pig hang
Date Mon, 29 Jul 2013 20:05:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722907#comment-13722907
] 

Rohini Palaniswamy commented on PIG-3395:
-----------------------------------------

Before the whole and/or tree will not be pushed down. The visit of lhs and rhs is still there,
but I am not sure how the replace will behave because it does not have full context and something
partial might get pushed. Can you just modify your testcase to include one of those conditions
to test the behaviour if we have cast or null check? 
                
> Large filter expression makes Pig hang
> --------------------------------------
>
>                 Key: PIG-3395
>                 URL: https://issues.apache.org/jira/browse/PIG-3395
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>
>         Attachments: PIG-3395.patch, thread_dump.txt
>
>
> Currently, partition filter push down is quite costly. For example, if you have many
nested or/and expressions, Pig hangs:
> {code}
> base = load '<partitioned table>' using MyStorage();
> filt = filter base by
> (dateint == 20130719 and batchid == 'merged_1' and hour IN (19,20,21,22,23))
> or
> (dateint == 20130720 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8))
> or
> (dateint == 20130720 and batchid == 'merged_2' and hour == 7)
> or
> (dateint == 20130720 and batchid == 'merged_1' and hour IN (9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
> or
> (dateint == 20130721 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23))
> or
> (dateint == 20130722 and batchid == 'merged_1' and hour IN (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16));
> dump filt;
> {code}
> Note that IN operator is converted to nested OR's by Pig parser.
> Looking at the thread dump, I found it creates almost 60 stack frames and makes JVM suffer.
(I will attach full stack trace.)
> {code}
> <repeated ...>
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:237)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:214)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:504)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:211)
> at org.apache.pig.newplan.PColFilterExtractor.visit(PColFilterExtractor.java:108)
> {code}
> Although the filter expression can be simplified, it seems possible to make PColFilterExtractor
more efficient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message