spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Webber <david.web...@gmail.com>
Subject RDD boundaries and triggering processing using tags in the data
Date Wed, 27 May 2015 19:52:57 GMT
Hi All,

I'm new to Spark and I'd like some help understanding if a particular use
case would be a good fit for Spark Streaming.

I have an imaginary stream of sensor data consisting of integers 1-10. 
Every time the sensor reads "10" I'd like to average all the numbers that
were received since the last "10"

example input: 10 5 8 4 6 2 1 2 8 8 8 1 6 9 1 3 10 1 3 10 ...
desired output: 4.8, 2.0

I'm confused about what happens if sensor readings fall into different RDDs.  

RDD1:  10 5 8 4 6 2 1 2 8 8 8
RDD2:  1 6 9 1 3 10 1 3 10
output: ???, 2.0

My imaginary sensor doesn't read at fixed time intervals, so breaking the
stream into RDDs by time interval won't ensure the data is packaged
properly.  Additionally, multiple sensors are writing to the same stream
(though I think flatMap can parse the origin stream into streams for
individual sensors, correct?).  

My best guess for processing goes like
1) flatMap() to break out individual sensor streams
2) Custom parser to accumulate input data until "10" is found, then create a
new output RDD for each sensor and data grouping
3) average the values from step 2

I would greatly appreciate pointers to some specific documentation or
examples if you have seen something like this before.

Thanks,
David



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-boundaries-and-triggering-processing-using-tags-in-the-data-tp23060.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message