spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diana Carroll <dcarr...@cloudera.com>
Subject Re: non-lazy execution of sortByKey?
Date Mon, 07 Apr 2014 17:06:58 GMT
Aha!  Well I'm not crazy then, thanks.


On Mon, Apr 7, 2014 at 11:51 AM, Mark Hamstra <mark@clearstorydata.com>wrote:

>
> https://issues.apache.org/jira/browse/SPARK-1021?jql=text%20~%20%22sortByKey%22
>
>
> On Mon, Apr 7, 2014 at 8:42 AM, Diana Carroll <dcarroll@cloudera.com>wrote:
>
>> Until today, I was under the impression that *all* Spark transformations
>> were "lazy"...that is, they wouldn't actually execute until an *action*
>> such as count or take was performed.
>>
>> However today I'm using the "sortByKey" transformation, which would
>> appear to execute immediately, rather than as a result of an operation.  Am
>> I misunderstanding something, is this a bug, or is this a deliberate
>> difference between sortByKey and other transformations?
>>
>> Here's my test. I'm parsing a bunch of weblog files and I want to know
>> which users made the most requests.  So my code pull out the 2nd field of
>> each line (the user ID), add up the total number of hits for each user ID,
>> swap user ID/hit count, and sort of hitcount.
>>
>> var userreqs =
>> sc.textFile("file:/home/training/training_materials/sparkdev/data/weblogs/*").
>>    map(_.split(" ")).
>>    map(words => (words(2),1)).
>>    reduceByKey(_ + _).
>>    map(pair => (pair._2,pair._1)).
>>    sortByKey(false)
>>
>> I thought nothing would actually happen here until I did
>> userreqs.take(10) but actually it did execute without the take(). It took
>> about a minute for it to complete and if I look at the web UI I see
>> completed execution of 3 stages:  (Why is sortByKey two stages?)
>>
>> [image: Inline image 2]
>>
>> Something else about this strikes me as odd, too.  If I follow this
>> command by userreqs.take(10), I think it executes the whole thing all over
>> again, but doesn't show all the stages: stage 3 is missing in the UI:
>> [image: Inline image 3]
>>
>>
>> Plus it seems to automatically be caching my results?  Because when I
>> execute "take(10)" repeatedly, subsequent executions are very fast, and
>> trigger only a single stage:
>>
>> [image: Inline image 4]
>>
>> And I confirmed it is caching because i tried deleting the underlying
>> files and the take() still worked.
>>
>> Anyone have any insight?
>>
>> Diana
>>
>
>

Mime
View raw message