Ted,
Good to see some suggested reading, today I was also thinking that possibly we could agree
on a canonical time series example data set to reference and test on since it seems like everyone
has slight variations in what sort of "time series" data they are using. It would also be
easier for me to discuss techniques when using a known example dataset as a common reference
point. When talking about decomposition techniques it would probably make it somewhat more
consistent well, and then anyone could work out from there to adapt it to their specific needs,
add "secret sauce", etc.
Josh Patterson
TVA
Original Message
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Sun 11/22/2009 1:58 PM
To: mahoutuser@lucene.apache.org
Subject: Re: mahout examples
This has connections back to the theoretical literature on dynamical systems
where the term "symbolic dynamics" is used to refer to the investigation of
the sequences of symbols that can be produced by various complex systems
depending on the quantization of the state space. The important takeaway
from that world is that essentially all of the important dynamical
properties of a system can be described by using the appropriate
quantization. Of course, figuring out what the appropriate quantization is
can be a major problem, but for many systems, almost any nontrivial
quantization is useful.
The use of a long ngrams also has analogies in statespace embedding of
dynamical systems where you don't have access to the full state of a system
but instead use repeated measurements to substitute for the the full state.
Again, in many cases you can do as much with an embedded statespace as you
could with the (unobservable) full state. A great example is the classic
dripping faucet problem (Shaw's paper isn't available, but this followup
is: http://metal.elte.hu/botond/pdf/CHA00059.pdf). One of the original
papers on the subject of statespace reconstruction is available online at
http://cos.cumt.edu.cn/jpkc/dxwl/zl/zl1/Physical%20Review%20Classics/statistical/132.pdf
The point here is that ngram techniques for time series are really not at
all farfetched.
On Sun, Nov 22, 2009 at 9:07 AM, Jake Mannix <jake.mannix@gmail.com> wrote:
> While I'll add the caveat also that I haven't done this for temporal data
> (other than the case which Ted is referring to, where text technically has
> some temporal
> nature to it by its sequentiality), doing this kind of thing with
> "significant
> ngrams" as Ted describes can allow you to arbitrarily keep higher order
> correlations if you
> do this in a randomized SVD: intead of just keeping the interesting
> bigrams,
> keep *all* ngrams up to some fixed size (even as large as 5, say), then to
> random
> projection on your bagofngrams to map it down from the huge
> numUniqueSymbols^{5} dimensional space (well technically this probably
> overflows sizeof(long), so you probably are wrapping around mod some big
> prime close
> to 2^64, but collisions will be still be rare and will just act as noise)
> down
> to some reasonable space still larger than you think is necessary (maybe
> 1k10k),
> *then* do the SVD there.
>

Ted Dunning, CTO
DeepDyve
