spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From YuhuWang2002 <...@git.apache.org>
Subject [GitHub] spark issue #15297: [SPARK-9862]Handling data skew
Date Tue, 25 Oct 2016 06:10:06 GMT
Github user YuhuWang2002 commented on the issue:

    https://github.com/apache/spark/pull/15297
  
    I do some performance test between use skew join algorithm and not use skew join  algorithm.
    I generate 2 table with 1/5 data skew in table S and 1/10000 data skew in table R. Two
table skew in same key.
    
    spark.sql.adaptive.skewjoin.threshold   6000000
    spark.sql.adaptive.shuffle.targetPostShuffleInputSize   5000000
    record: S 10000000 rows; R 100000000 rows
    sql:
    select count(*) from R,S where rid=sid and sname>'wang9' and rname > 'zhang9';
    
    skew algorithm : 167.695s
    normal algorithm: 303.922s
    
    R2_txt is 100000000 rows without data skew.
    sql: select count(*) from R2_txt,S where rid=sid and sname>'wang' and rname > 'zhang9';
    skew algorithm : 38.717s
    normal algorithm: 114.21s



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message