I believe Ted is correct, the distance function is the key with kmeans.
Another suggestion that I've heard for comparing similarity between time series data is comparing
"energy", or something along the lines of CUSUM:
http://www.variation.com/cpa/help/hs108.htm
http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc323.htm
http://www.public.iastate.edu/~vardeman/asqcourse/cusumcharts.pdf
http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Articles/cumulative_sum_charts_for_small_shifts.pdf
Waveforms could be broken down into sequences (or ngrams even) of units of energy change
which might simplify the comparison calculation somewhat. I've used a similar technique (taskresponsethresholds)
when working with Ant Colony mechanics:
http://jpatterson.floe.tv/index.php/2008/09/25/discoveryandsocialinsects/
http://jpatterson.floe.tv/index.php/2009/07/19/tinyosandtinytermite/
in that enough change over a time period triggers a taskchange / response function. Here
I think its simply "more energy change than the previous N samples triggers an ouput" which
ends up being a feature vector. I believe that one could see an "energy pattern" derived from
a waveform and that could possibly indicate an aspect of similarity. I think this would alleviate
the vertical shift problem and then you could focus on the horizontal shift issue which has
been dealt with in IR by using various forms of ngram techniques, intersection of forward
and reverse btrees for wildcard sequences / comparisons, etc. Of course I'm guessing there,
but this problem will involve a lot of trial and error in terms of basic building blocks for
timeseries comparison  and itâ€™s a starting point.
Josh Patterson
TVA
Original Message
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Monday, November 30, 2009 2:54 AM
To: mahoutuser@lucene.apache.org
Subject: Re: mahout examples
Nsquared isn't all that scalable.
But kmeans doesn't require all n^2 distances to be computed. It just
requires that distances to the centroids be computed.
That is generally vastly less effort and since the number of clusters is
typically bounded by a constant, it makes kmeans into a nearly linear
algorithm.
On Sun, Nov 29, 2009 at 11:07 PM, prasenjit mukherjee <
pmukherjee@quattrowireless.com> wrote:
> Thanks for sharing the article. The article focuses mainly on
> distance computation between sequences, which will help us in creating
> the selfsimilarity matrix. And then you can probably apply any
> standard selfsimilarity based clustering techniques ( spectral
> clustering or kmeans etc. ).
>
> Approach sounds okay, except that kmeans requires the nXn matrix to
> be computed which itself could be pretty huge. But as long as you can
> distribute ( which you apparantly can ) over mapreduce/mahout it
> should be fine.
>
> Prasen
>
> On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <jpatterson0@tva.gov>
> wrote:
> > Prasen,
> > I've been reviewing techniques and literature on data mining in time
> > series and I found another paper that you might be interested in from
> > the time series "search" domain that deals with similarity of time
> > series data:
> >
> > http://delab.csd.auth.gr/papers/PCI99am.pdf
> >
> > Sequences are transformed into a feature vector and Euclidean distances
> > between the feature vectors are then calculated. I'm still getting this
> > concept (plus other variations) and mahout in general "mapped out". I
> > read some on suffix trees and they look very similar to kgrams and
> > permutation indexes in Information Retrieval material. I'm still
> > digesting this time series problem (and its several sub problems) but I
> > thought I'd throw that paper out there and see what you thought.
> >
> > Josh Patterson
> > TVA
> >
> > Original Message
> > From: prasen.bea@gmail.com [mailto:prasen.bea@gmail.com] On Behalf Of
> > prasenjit mukherjee
> > Sent: Saturday, November 21, 2009 12:21 AM
> > To: mahoutuser@lucene.apache.org
> > Subject: Re: mahout examples
> >
> > Hi Josh,
> >
> > I too am working on clustering timeseriesdata, and basically trying
> > to come up with a sequence clustering model. Would like to know how
> > you intend to use Kmeans to achieve that. Are you treating each
> > sequence as a point ? Then, what would be your vector representation
> > of a sequence and also more importantly which metric ( distance
> > computation logic ) will you be using ?
> >
> > BTW, I am thinking along the lines of STC ( suffixtree based clustering
> > ).
> >
> > Prasen
> >
> > On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <jpatterson0@tva.gov>
> > wrote:
> >> I think in terms of clustering time series data, the first step looks
> > to
> >> be vectorizing the input cases with possibly the DenseVector class and
> >> feeding that to a basic KMeans implementation like KMeansDriver.java.
> >> Once we can get the basic kmeans rolling with some known dataset we'll
> >> be able to iterate on that and move towards using more complex
> >> techniques and other grid timeseries data. Any suggestions or
> > discussion
> >> is greatly appreciated,
> >>
> >> Josh Patterson
> >> TVA
> >
>

Ted Dunning, CTO
DeepDyve
