spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <>
Subject Re: Streaming anomaly detection using ARIMA
Date Mon, 30 Mar 2015 13:30:56 GMT
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched style of the
DStreams API.

On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet <> wrote:

> I want to use ARIMA for a predictive model so that I can take time series
> data (metrics) and perform a light anomaly detection. The time series data
> is going to be bucketed to different time units (several minutes within
> several hours, several hours within several days, several days within
> several years.
> I want to do the algorithm in Spark Streaming. I'm used to "tuple at a
> time" streaming and I'm having a tad bit of trouble gaining insight into
> how exactly the windows are managed inside of DStreams.
> Let's say I have a simple dataset that is marked by a key/value tuple
> where the key is the name of the component who's metrics I want to run the
> algorithm against and the value is a metric (a value representing a sum for
> the time bucket. I want to create histograms of the time series data for
> each key in the windows in which they reside so I can use that histogram
> vector to generate my ARIMA prediction (actually, it seems like this
> doesn't just apply to ARIMA but could apply to any sliding average).
> I *think* my prediction code may look something like this:
> val predictionAverages = dstream
>   .groupByKeyAndWindow(60*60*24, 60*60*24)
>   .mapValues(applyARIMAFunction)
> That is, keep 24 hours worth of metrics in each window and use that for
> the ARIMA prediction. The part I'm struggling with is how to join together
> the actual values so that i can do my comparison against the prediction
> model.
> Let's say dstream contains the actual values. For any time  window, I
> should be able to take a previous set of windows and use model to compare
> against the current values.

View raw message