spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5452) We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue
Date Wed, 28 Jan 2015 10:59:37 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294995#comment-14294995
] 

Sean Owen commented on SPARK-5452:
----------------------------------

[~Irfan123] I do not think this is suitable for JIRA, and so would not reopen it. However,
let's leave it open for the moment. You haven't specified why you think this is exceptionally
slow, or narrowed down the part that is running slower than expected. Nobody has your data
set, and so can't see these details. This info would help make a good question on user@. Therefore
it's not clear there is an issue in Spark per se, and that is what JIRA is for discussing.


> We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a
look on this issue
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5452
>                 URL: https://issues.apache.org/jira/browse/SPARK-5452
>             Project: Spark
>          Issue Type: Test
>          Components: Spark Shell
>    Affects Versions: 1.2.0
>            Reporter: irfan
>              Labels: SparkSql
>
> Hi Team,
> we are migrating TeraData SQL to Spark SQL because of complexity we have spilted into
below 4 sub-quries
> and we are running through  hive context
> ====================================================
> val HIVETMP1 = hc.sql("SELECT PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID,LMS_ACCOUNT_ID AS
LMS_ACCOUNT_ID FROM VW_PARTY_ACCOUNT WHERE  PARTY_ACCOUNT_TYPE_CODE IN('04') AND  LMS_ACCOUNT_ID
 IS NOT NULL")
> HIVETMP1.registerTempTable("VW_HIVETMP1")
> val HIVETMP2 = hc.sql("SELECT PACCNT.LMS_ACCOUNT_ID AS  LMS_ACCOUNT_ID, 'NULL' AS  RANDOM_PARTY_ACCOUNT_ID
,'NULL' AS  MOST_RECENT_SPEND_LA ,STXN.PARTY_ACCOUNT_ID AS  MAX_SPEND_12WKS_LA ,STXN.MAX_SPEND_12WKS_LADATE
 AS MAX_SPEND_12WKS_LADATE FROM VW_HIVETMP1 AS PACCNT  INNER JOIN (SELECT STXTMP.PARTY_ACCOUNT_ID
AS PARTY_ACCOUNT_ID, SUM(CASE WHEN (CAST(STXTMP.TRANSACTION_DATE AS DATE ) > DATE_SUB(CAST(CONCAT(SUBSTRING(SYSTMP.OPTION_VAL,1,4),'-',SUBSTRING(SYSTMP.OPTION_VAL,5,2),'-',SUBSTRING(SYSTMP.OPTION_VAL,7,2))
AS DATE),84)) THEN STXTMP.TRANSACTION_VALUE ELSE 0.00 END) AS MAX_SPEND_12WKS_LADATE FROM
VW_SHOPPING_TRANSACTION_TABLE AS STXTMP INNER JOIN SYSTEM_OPTION_TABLE AS SYSTMP ON STXTMP.FLAG
== SYSTMP.FLAG AND  SYSTMP.OPTION_NAME = 'RID' AND STXTMP.PARTY_ACCOUNT_TYPE_CODE IN('04')
GROUP BY STXTMP.PARTY_ACCOUNT_ID) AS STXN ON PACCNT.PARTY_ACCOUNT_ID = STXN.PARTY_ACCOUNT_ID
WHERE  STXN.MAX_SPEND_12WKS_LADATE IS NOT NULL")
> HIVETMP2.registerTempTable("VW_HIVETMP2")
> val HIVETMP3 = hc.sql("SELECT LMS_ACCOUNT_ID,MAX(MAX_SPEND_12WKS_LA) AS MAX_SPEND_12WKS_LA,
1 AS RANK FROM VW_HIVETMP2 GROUP BY LMS_ACCOUNT_ID")
> HIVETMP3.registerTempTable("VW_HIVETMP3")
> val HIVETMP4 = hc.sql(" SELECT PACCNT.LMS_ACCOUNT_ID,'NULL' AS  RANDOM_PARTY_ACCOUNT_ID
,'NULL' AS  MOST_RECENT_SPEND_LA,STXN.MAX_SPEND_12WKS_LA AS MAX_SPEND_12WKS_LA,1 AS RANK1
FROM VW_HIVETMP2 AS PACCNT INNER JOIN VW_HIVETMP3 AS STXN ON PACCNT.LMS_ACCOUNT_ID = STXN.LMS_ACCOUNT_ID
AND PACCNT.MAX_SPEND_12WKS_LA = STXN.MAX_SPEND_12WKS_LA")
> HIVETMP4.registerTempTable("WT03_ACCOUNT_BHVR3")
> HIVETMP4.saveAsTextFile("hdfs:/file/")
> ==========================
> This query has two Group By clauses which are running on huge files(19.5GB). And the
query took 40min to get the final result. Is there any changes required in run time environment
or Configuration Setting in Spark which can improve the query performance.
> below are our Environment and configuration details:
> Environment  details:
> No of nodes:4
> capacity on each node:62 GB RAM on each node.
> Storage capacity     :9TB on each node
> total cores          :48  
> Spark Configuration:
>  
> .set("spark.default.parallelism","64")
> .set("spark.driver.maxResultSize","2G")
> .set("spark.driver.memory","10g")
> .set("spark.rdd.compress","true")
> .set("spark.shuffle.spill.compress","true")
> .set("spark.shuffle.compress","true")
> .set("spark.shuffle.consolidateFiles","true/false")
> .set("spark.shuffle.spill","true/false") 
>  
> Data file size :
> SHOPPING_TRANSACTION 19.5GB
> PARTY_ACCOUNT        1.4GB
> SYSTEM_OPTIONS       11.6K
> please help us to resolve above issue.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message