spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ehrlich <and...@aehrlich.com>
Subject Re: the spark job is so slow - almost frozen
Date Wed, 20 Jul 2016 03:35:19 GMT
Try:

- filtering down the data as soon as possible in the job, dropping columns you don’t need.
- processing fewer partitions of the hive tables at a time
- caching frequently accessed data, for example dimension tables, lookup tables, or other
datasets that are repeatedly accessed
- using the Spark UI to identify the bottlenecked resource
- remove features or columns from the output data, until it runs, then add them back in one
at a time.
- creating a static dataset small enough to work, and editing the query, then retesting, repeatedly
until you cut the execution time by a significant fraction
- Using the Spark UI or spark shell to check the skew and make sure partitions are evenly
distributed

> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID> wrote:
> 
> Thanks a lot for your reply .
> 
> In effect , here we tried to run the sql on kettle, hive and spark hive (by HiveContext)
respectively, the job seems frozen  to finish to run .
> 
> In the 6 tables , need to respectively read the different columns in different tables
for specific information , then do some simple calculation before output . 
> join operation is used most in the sql . 
> 
> Best wishes! 
> 
> 
> 
> 
> On Monday, July 18, 2016 6:24 PM, Chanh Le <giaosudau@gmail.com> wrote:
> 
> 
> Hi,
> What about the network (bandwidth) between hive and spark? 
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show what going on.
> 
> 
> 
> 
>  
>> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID <mailto:zchl.jump@yahoo.com.invalid>>
wrote:
>> 
>> the sql logic in the program is very much complex , so do not describe the detailed
codes   here . 
>> 
>> 
>> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zchl.jump@yahoo.com.INVALID <mailto:zchl.jump@yahoo.com.invalid>>
wrote:
>> 
>> 
>> Hi All,  
>> 
>> Here we have one application, it needs to extract different columns from 6 hive tables,
and then does some easy calculation, there is around 100,000 number of rows in each table,
>> finally need to output another table or file (with format of consistent columns)
.
>> 
>>  However, after lots of days trying, the spark hive job is unthinkably slow - sometimes
almost frozen. There is 5 nodes for spark cluster. 
>>  
>> Could anyone offer some help, some idea or clue is also good. 
>> 
>> Thanks in advance~
>> 
>> Zhiliang 
>> 
>> 
> 
> 
> 


Mime
View raw message