mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patterson, Josh" <jpatters...@tva.gov>
Subject RE: mahout examples
Date Mon, 30 Nov 2009 16:47:33 GMT
I believe Ted is correct, the distance function is the key with k-means.

Another suggestion that I've heard for comparing similarity between time series data is comparing
"energy", or something along the lines of CUSUM:

http://www.variation.com/cpa/help/hs108.htm
http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc323.htm
http://www.public.iastate.edu/~vardeman/asqcourse/cusumcharts.pdf
http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Articles/cumulative_sum_charts_for_small_shifts.pdf

Waveforms could be broken down into sequences (or n-grams even) of units of energy change
which might simplify the comparison calculation somewhat. I've used a similar technique (task-response-thresholds)
when working with Ant Colony mechanics:

http://jpatterson.floe.tv/index.php/2008/09/25/discovery-and-social-insects/
http://jpatterson.floe.tv/index.php/2009/07/19/tinyos-and-tinytermite/


in that enough change over a time period triggers a task-change / response function. Here
I think its simply "more energy change than the previous N samples triggers an ouput" which
ends up being a feature vector. I believe that one could see an "energy pattern" derived from
a waveform and that could possibly indicate an aspect of similarity. I think this would alleviate
the vertical shift problem and then you could focus on the horizontal shift issue which has
been dealt with in IR by using various forms of n-gram techniques, intersection of forward
and reverse b-trees for wildcard sequences / comparisons, etc. Of course I'm guessing there,
but this problem will involve a lot of trial and error in terms of basic building blocks for
timeseries comparison --- and it’s a starting point. 

Josh Patterson
TVA




-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, November 30, 2009 2:54 AM
To: mahout-user@lucene.apache.org
Subject: Re: mahout examples

N-squared isn't all that scalable.

But k-means doesn't require all n^2 distances to be computed.  It just
requires that distances to the centroids be computed.

That is generally vastly less effort and since the number of clusters is
typically bounded by a constant, it makes k-means into a nearly linear
algorithm.

On Sun, Nov 29, 2009 at 11:07 PM, prasenjit mukherjee <
pmukherjee@quattrowireless.com> wrote:

> Thanks  for sharing the article.  The article focuses mainly on
> distance computation between sequences, which will help us in creating
> the self-similarity matrix.  And then you can probably apply any
> standard self-similarity based clustering techniques ( spectral
> clustering or k-means etc. ).
>
> Approach sounds okay, except that k-means requires the nXn matrix to
> be computed which itself could be pretty huge.  But as long as you can
> distribute ( which you apparantly can ) over mapreduce/mahout it
> should be fine.
>
> -Prasen
>
> On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <jpatterson0@tva.gov>
> wrote:
> > Prasen,
> > I've been reviewing techniques and literature on data mining in time
> > series and I found another paper that you might be interested in from
> > the time series "search" domain that deals with similarity of time
> > series data:
> >
> > http://delab.csd.auth.gr/papers/PCI99am.pdf
> >
> > Sequences are transformed into a feature vector and Euclidean distances
> > between the feature vectors are then calculated. I'm still getting this
> > concept (plus other variations) and mahout in general "mapped out". I
> > read some on suffix trees and they look very similar to k-grams and
> > permutation indexes in Information Retrieval material. I'm still
> > digesting this time series problem (and its several sub problems) but I
> > thought I'd throw that paper out there and see what you thought.
> >
> > Josh Patterson
> > TVA
> >
> > -----Original Message-----
> > From: prasen.bea@gmail.com [mailto:prasen.bea@gmail.com] On Behalf Of
> > prasenjit mukherjee
> > Sent: Saturday, November 21, 2009 12:21 AM
> > To: mahout-user@lucene.apache.org
> > Subject: Re: mahout examples
> >
> > Hi Josh,
> >
> > I too am working on  clustering time-series-data, and basically trying
> > to come up with a sequence clustering model. Would like to know how
> > you intend to use K-means to achieve that.  Are you  treating each
> > sequence as a point ?  Then, what would be your vector representation
> > of a sequence and also more importantly which metric ( distance
> > computation logic ) will you be using ?
> >
> > BTW, I am thinking along the lines of STC ( suffix-tree based clustering
> > ).
> >
> > -Prasen
> >
> > On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jpatterson0@tva.gov>
> > wrote:
> >> I think in terms of clustering time series data, the first step looks
> > to
> >> be vectorizing the input cases with possibly the DenseVector class and
> >> feeding that to a basic KMeans implementation like KMeansDriver.java.
> >> Once we can get the basic kmeans rolling with some known dataset we'll
> >> be able to iterate on that and move towards using more complex
> >> techniques and other grid timeseries data. Any suggestions or
> > discussion
> >> is greatly appreciated,
> >>
> >> Josh Patterson
> >> TVA
> >
>



-- 
Ted Dunning, CTO
DeepDyve
Mime
View raw message