spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kannan Subramanian (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-21577) Issue is handling too many aggregations
Date Sun, 30 Jul 2017 12:47:00 GMT
Kannan Subramanian created SPARK-21577:
------------------------------------------

             Summary: Issue is handling too many aggregations 
                 Key: SPARK-21577
                 URL: https://issues.apache.org/jira/browse/SPARK-21577
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.0
         Environment: Cloudera CDH 1.8.3
Spark 1.6.0
            Reporter: Kannan Subramanian


 my requirement, reading the table from hive(Size - around 1.6 TB). I have to do more than
200 aggregation operations mostly avg, sum and std_dev. Spark application total execution
time is take more than 12 hours. To Optimize the code I used shuffle Partitioning and memory
tuning and all. But Its nothelpful for me. Please note that same query I ran in hive on map
reduce. MR job completion time taken around only 5 hours.  Kindly let me know is there any
way to optimize or efficient way of handling multiple aggregation operations.    val inputDataDF
= hiveContext.read.parquet("/inputparquetData")    inputDataDF.groupBy("seq_no","year", "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),
 avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... like
200 columns)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message