On the same note, a parallelizable form of AVF - Attribute Value
Frequency, looks to be promising for rapid outlier detection using
hadoop. A paper titled "a fast parallel outlier detection for
categorical datasets using map reduce" gives more info.
I am looking for various techniques and tools that would enable me to
detect and score outliers on massive datasets that might be streaming.
Just began studying some techniques and got some pointers from you
all.
Thanks,
Srinivas.
On Wednesday, November 3, 2010, Srivathsan Srinivas
wrote:
> Thanks. I am reading a recent paper of Keogh's = time series shapelets
> : a novel technique that allows accurate, interpretable and fast
> classification. A springer publication of data mining and knowledge
> discovery, 18 June 2010.
>
> I am basically skimming several papers with different ideas to see
> what can bec easily and efficiently parrallelized for using hadoop...
>
> Thanks much for pointing to the presentation and the paper.
>
> Srinivas.
>
> On Wednesday, November 3, 2010, Federico Castanedo wrote:
>> Hi,
>>
>> 2010/11/1 Srivathsan Srinivas :
>>> Dear Ted,
>>>
>>> Thanks for pointing to Dirchlet mixture model. I shall look into that.
>>>
>>> Basically, I am looking into auto correlation function, Control Charts,
>>> Moving Average, Population Stability, and Poisson regression (much of the
>>> data can be described as daily|hourly counts)– I’d like to build a tool that
>>> would blend these approaches into a scorecard for proactive alerting for any
>>> outliers...
>>>
>>> For the above, I am interested in seeing how the time-series data can be
>>> broken into manageable segments and distributed-off to different machines in
>>> a Hadoop network.
>>>
>> I've never seen something similar in hadoop, but my suggestion for a
>> good algorithm for
>> segmenting time-series is:
>>
>> Sliding Window And Bottom-Up (SWAB) from Keogh et. al. Here is the paper:
>>
>> http://www.cs.ucr.edu/~eamonn/icdm-01.pdf
>>
>> and here a presentation:
>> www-scf.usc.edu/~selinach/segmentation-slides.pd
>>
>>
>>> Thanks again,
>>> Sri.
>>>
>>>
>>> On Mon, Nov 1, 2010 at 10:21 AM, Ted Dunning wrote:
>>>
>>>> There is nothing explicit in Mahout for this, but you could use the
>>>> Dirchlet
>>>> mixture model clustering to do this.
>>>>
>>>> The idea would be to express your different observed time series or short
>>>> segments of time sequences as mixture
>>>> models and then find regions that are not well described by this mixture
>>>> model. Ideally, you would have a Markov
>>>> model underneath the mixture coefficients, but that is out of scope for
>>>> what
>>>> Mahout does for you right off the bat. It
>>>> wouldn't be too hard to merge the HMM code and the DP clustering to get
>>>> this, though.
>>>>
>>>> So the answer is no.
>>>>
>>>> But Mahout would be a decent substrate for building your own.
>>>>
>>>> On Mon, Nov 1, 2010 at 8:02 AM, Srivathsan Srinivas <
>>>> srivathsan.srinivas@gmail.com> wrote:
>>>>
>>>> > Hi,
>>>> > Any pointers to techniques/papers that detect outliers in
>>>> time-series
>>>> > of very large data sets using Mahout? I am interesting in seeing what
>>>> > techniques are favorable for use in large-scale distributed systems using
>>>> > Hadoop/Mahout.
>>>> >
>>>> > Thanks,
>>>> > Sri.
>>>> >
>>>>
>>>
>>
>