spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tdas <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-5372][Streaming] Change the default sto...
Date Fri, 30 Jan 2015 02:33:21 GMT
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/4167#issuecomment-72144338
  
    Well, it generally considered outside the scope of Spark Streaming to windows that large!
If you have to process data that is a day old, then you probably need a dedicated storage
system. Spark Streaming is not a storage system, so using it for long term data storage is
using it outside its design space. Breaking default-case performance for those out-of-the-design-space
scenarios is not the right solution. Those should be handled by changing the storage level
directly. And users who need that sort of performance across such large windows, obviously
need to learn a bit more about Spark Streaming. We can probably help them learn. maybe add
some stuff in the programming guide?
    
    An alternate, more sophisticated solution is to detect when such spillover is continuously
happening and print suggestion (log4j warnings) saying "not enough memory to store the whole
window, consider using memory_and_disk for the windowed stream". This is definitely trickier
to do but a safer solution that does not involve regression. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message