spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jay vyas (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-4040) calling count() on RDD's emitted from a DStream blocks forEachRDD progress.
Date Tue, 21 Oct 2014 20:54:33 GMT

     [ https://issues.apache.org/jira/browse/SPARK-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

jay vyas updated SPARK-4040:
----------------------------
    Description: 
CC [~rnowling] [~willbenton] 
 
It appears that in a DStream context, a call to   {{MappedRDD.count()}} blocks progress and
prevents emission of RDDs from a stream.

{noformat}
    tweetStream.foreachRDD((rdd,lent)=> {
      tweetStream.repartition(1)
      //val count = rdd.count()  DONT DO THIS !
      checks += 1;
      if (checks > 20) {
        ssc.stop()
      }
   }
{noformat} 

The above code block should inevitably halt, after 20 intervals of RDDs... However, if we
*uncomment the call* to {{rdd.count()}}, it turns out that we get an *infinite stream which
emits no RDDs*, and thus our program *runs forever* (ssc.stop is unreachable), because *forEach
doesnt receive any more entries*.  

I suspect this is actually because the foreach block never completes, because {{count()}}
is winds up calling {{compute}}, which ultimately just reads from the stream.

I havent put together a minimal reproducer or unit test yet, but I can work on doing so if
more info is needed.

I guess this could be seen as an application bug - but i think spark might be made smarter
to throw its hands up when people execute blocking code in a stream processor. 

  was:
CC [~rnowling] [~willbenton] 
 
It appears that in a DStream context, a call to   {{MappedRDD.count()}} blocks progress and
prevents emission of RDDs from a stream.

{noformat}
    tweetStream.foreachRDD((rdd,lent)=> {
      tweetStream.repartition(1)
      //val count = rdd.count()  DONT DO THIS !
      checks += 1;
      if (checks > 20) {
        ssc.stop()
      }
   }
{noformat} 

The above code block should inevitably halt, after 20 intervals of RDDs... However, if we
**uncomment the call** to {{rdd.count()}}, it turns out that we get an **infinite stream which
emits no RDDs, and thus our program runs forever (ssc.stop is unreachable), because forEach
doesnt proceed**

I havent put together a minimal reproducer or unit test yet, but I can work on doing so if
more info is needed.



> calling count() on RDD's emitted from a DStream blocks forEachRDD progress.
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-4040
>                 URL: https://issues.apache.org/jira/browse/SPARK-4040
>             Project: Spark
>          Issue Type: Bug
>            Reporter: jay vyas
>
> CC [~rnowling] [~willbenton] 
>  
> It appears that in a DStream context, a call to   {{MappedRDD.count()}} blocks progress
and prevents emission of RDDs from a stream.
> {noformat}
>     tweetStream.foreachRDD((rdd,lent)=> {
>       tweetStream.repartition(1)
>       //val count = rdd.count()  DONT DO THIS !
>       checks += 1;
>       if (checks > 20) {
>         ssc.stop()
>       }
>    }
> {noformat} 
> The above code block should inevitably halt, after 20 intervals of RDDs... However, if
we *uncomment the call* to {{rdd.count()}}, it turns out that we get an *infinite stream which
emits no RDDs*, and thus our program *runs forever* (ssc.stop is unreachable), because *forEach
doesnt receive any more entries*.  
> I suspect this is actually because the foreach block never completes, because {{count()}}
is winds up calling {{compute}}, which ultimately just reads from the stream.
> I havent put together a minimal reproducer or unit test yet, but I can work on doing
so if more info is needed.
> I guess this could be seen as an application bug - but i think spark might be made smarter
to throw its hands up when people execute blocking code in a stream processor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message