mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Beginner questions on clustering & M/R
Date Thu, 15 Jul 2010 16:23:36 GMT
Clustering of time series data is usually better done in an abstract
relatively low dimensional coordinate space based on some transform like a
locality sensitive frequency transform.  Gabor transforms might be
appropriate.

You might be able to get away with something like an SVD of your daily
change data.

On Thu, Jul 15, 2010 at 7:51 AM, Florent Empis <florent.empis@gmail.com>wrote:

> Hi,
>
> I want to learn more on clustering techniques. I have skimmed through
> Programming Collective Intelligence and Mahout in Action in the past but I
> don't have them on hand at the moment... :(
> I've seen Isabel Drost mail about test data on http://mldata.org/about/
> I've had an idea of using http://mldata.org/repository/view/stockvalues/for
> a pet project.
> My idea is as follow: can we see a common behaviour between companies'
> stock
> value?
> I would expect ending up with cluster of banking sector shares, utilities
> share, media etc... and maybe some more unexpected cluster, who knows?
>
> My idea is basically:
> 1°)Transform the dataset from values to daily variation as percentage
> drop/raise (data is then normalized)
> 2°)Apply clustering technique(s)
>
> The issue may seem silly but as I understand it, clustering happens in a 2
> (or more) dimension space.
> I know I have 2 dimensions: variation and time, but I can't wrap my head on
> the problem...
>
> I *think* that the K-Means example does exactly what I intend to do my
> second step, is this correct?
> However, I can grasp what the 2 dimensional display represent exactly: what
> are the x and y axis ?
>
> Added question: I am fairly new to the M/R paradigm, but let's say I would
> like to do step 1 (data normalization) in a M/R fashion. Would the
> following
> be a good idea:
> My data is a matrix of k stock values S in n intervals of time.
> I call the first stock in the file, first and second period:
> S1,t & S1,t+1 ...
>
> Map Step: input: ((S1,t ... S1,t+n),... ,(Sk,t ... Sk,t+n) )
> output (( (S1,t;S1,t+1),...,(S1,t+n-1;S1,t+n)), ... ,(
> (Sk,t;Sk,t+1),...,(Sk,t+n-1;Sk,t+n)) )
> Reduce Step:
> ( (%S1,t+1.....%S1,t+n), ...,(%S1,t+1.....%S1,t+n))
>
> I apologize for my beginner's questions but.... everyone has to start
> somewhere :-)
>
> BR,
>
> Florent Empis
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message