hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fei Hu <hufe...@gmail.com>
Subject Re: Spark application Runtime Measurement
Date Sun, 10 Jul 2016 15:37:50 GMT
Hi Mich,

Thank you for your detailed response. I have one more question.

In your case, the sum time of individual jobs (earliest job 47 to last job
58) equals to the time you print out by code.

But in my case, the sum time of all the individual jobs (17.8 seconds ) is
much less than the time between the start time and end up time (120
seconds). After the seconds job, the Spark application stops 25 seconds,
then continue the final job. After the final job, Spark takes 50 seconds to
end this application. Do you know what happened between the individual jobs?

Thanks,
Fei



On Sun, Jul 10, 2016 at 1:58 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Hi,
>
> Ultimately regardless of individual components timing, what matter is the
> elapsed time from the start of the job till end of the job. If I do a
> performance test I measure it three times and average the timing and to me
> that time is the time taken between start and end.
>
> Example
>
>
> *println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
> ").collect.foreach(println)*HiveContext.sql("use oraclehadoop")
> val s =
> HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
> val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
> val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
> println ("\ncreating data set at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
> ").collect.foreach(println)
> val rs =
> s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
> println ("\nfirst query at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
> ").collect.foreach(println)
> val rs1 =
> rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
> println ("\nsecond query at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
> ").collect.foreach(println)
> val rs2
> =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
> *println ("\nFinished at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss')
> ").collect.foreach(println)*
>
> So in here I look at individual timings as well.
>
> Now Spark UI breaks down the timings for each job and stages.
>
> As far as measurements concerned I have start time as
>
>
> Started at
> [10/07/2016 06:05:55.55]
> res32: org.apache.spark.sql.DataFrame = [result: string]
> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID:
> timestamp, CHANNEL_ID: bigint]
> c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: double, CHANNEL_DESC:
> string]
> t: org.apache.spark.sql.DataFrame = [TIME_ID: timestamp,
> CALENDAR_MONTH_DESC: string]
> creating data set at
> [10/07/2016 06:05:56.56]
> rs: org.apache.spark.sql.DataFrame = [calendar_month_desc: string,
> channel_desc: string, TotalSales: decimal(20,0)]
> first query at
> [10/07/2016 06:05:56.56]
> second query at
> [10/07/2016 06:17:18.18]
> Finished at
> [10/07/2016 06:33:35.35]
>
> So the job took *27 minutes 39 seconds* or 1659 seconds to finish
>
> From Spark UI I have
>
> [image: Inline images 1]
>
> Starting at Job 47 and finishing at job 58 as below
>
> [image: Inline images 2]
>
>
> Which adds up to 1623.1 seconds from duration but what matters is the
> start time and end up for me i.e. 2016/07/10 06:05:55 and 2016/07/10
> 06:33:35 which is what my measurements said including elapsed time between
> job start and end
>
> So from UI what matters is the start of earliest job 47 here and end of
> last job 58.
>
> I would take that as the indication noting from UI the individual job
> timing.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 July 2016 at 04:57, Fei Hu <hufei68@gmail.com> wrote:
>
>> Dear all,
>>
>> I have a question about how to measure the runtime for a Spak
>> application. Here is an example:
>>
>>
>>    - On the Spark UI: the total duration time is 2.0 minutes = 120
>>    seconds as following
>>
>> [image: Screen Shot 2016-07-09 at 11.45.44 PM.png]
>>
>>    - However, when I check the jobs launched by the application, the
>>    time is 13s + 0.8s + 4s = 17.8 seconds, which is much less than 120
>>    seconds. I am not sure which time I should choose to measure the
>>    performance of the Spark application.
>>
>> [image: Screen Shot 2016-07-09 at 11.48.26 PM.png]
>>
>>    - I also check the event timeline as following. There is a big gap
>>    between the second job and the third job. I do not know what happened
>>    during that gap.
>>
>> [image: Screen Shot 2016-07-09 at 11.53.29 PM.png]
>>
>> Is there anyone who can help explain which time is the exact time to
>> measure the performance of a Spark application.
>>
>> Thanks in advance,
>> Fei
>>
>>
>

Mime
View raw message