mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patterson, Josh" <>
Subject RE: mahout examples
Date Sun, 22 Nov 2009 21:39:02 GMT

Good to see some suggested reading, today I was also thinking that possibly we could agree
on a canonical time series example data set to reference and test on since it seems like everyone
has slight variations in what sort of "time series" data they are using. It would also be
easier for me to discuss techniques when using a known example dataset as a common reference
point. When talking about decomposition techniques it would probably make it somewhat more
consistent well, and then anyone could work out from there to adapt it to their specific needs,
add "secret sauce", etc.

Josh Patterson

-----Original Message-----
From: Ted Dunning []
Sent: Sun 11/22/2009 1:58 PM
Subject: Re: mahout examples
This has connections back to the theoretical literature on dynamical systems
where the term "symbolic dynamics" is used to refer to the investigation of
the sequences of symbols that can be produced by various complex systems
depending on the quantization of the state space.  The important take-away
from that world is that essentially all of the important dynamical
properties of a system can be described by using the appropriate
quantization.  Of course, figuring out what the appropriate quantization is
can be a major problem, but for many systems, almost any non-trivial
quantization is useful.

The use of a long n-grams also has analogies in state-space embedding of
dynamical systems where you don't have access to the full state of a system
but instead use repeated measurements to substitute for the the full state.
Again, in many cases you can do as much with an embedded state-space as you
could with the (unobservable) full state.  A great example is the classic
dripping faucet problem (Shaw's paper isn't available, but this follow-up
is:  One of the original
papers on the subject of state-space reconstruction is available on-line at

The point here is that n-gram techniques for time series are really not at
all far-fetched.

On Sun, Nov 22, 2009 at 9:07 AM, Jake Mannix <> wrote:

> While I'll add the caveat also that I haven't done this for temporal data
> (other than the case which Ted is referring to, where text technically has
> some temporal
> nature to it by its sequentiality), doing this kind of thing with
> "significant
> ngrams" as Ted describes can allow you to arbitrarily keep higher order
> correlations if you
> do this in a randomized SVD: intead of just keeping the interesting
> bi-grams,
> keep *all* ngrams up to some fixed size (even as large as 5, say), then to
> random
> projection on your bag-of-ngrams to map it down from the huge
> numUniqueSymbols^{5} dimensional space (well technically this probably
> overflows sizeof(long), so you probably are wrapping around mod some big
> prime close
> to 2^64, but collisions will be still be rare and will just act as noise)
> down
> to some reasonable space still larger than you think is necessary (maybe
> 1k-10k),
> *then* do the SVD there.

Ted Dunning, CTO

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message