mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Modelling typed vectors?
Date Wed, 13 Oct 2010 06:07:39 GMT
On Tue, Oct 12, 2010 at 5:30 PM, Lance Norskog <goksron@gmail.com> wrote:

> This use case is doing Random Projection with "paired vectors". Look up
> 'semantic vectors' for an explanation.
>

Even so, I think that there is another way to do this by just keeping an id
on each vector.

In random projection, it is common to use a random matrix whose elements
that can be regenerated at will.  This allows
us to not actually transfer the elements of the random matrix and makes it
possible to use pieces of the same random matrix
in different places.

If you want others to comment on your detailed use case, it would help if
you could explain it more fully here.  I don't see any real
need for a payload in my understanding of paired random indexing.

My pipeline is three different M/R jobs in sequence with three different
> semantics for the output vectors. The payload has to be included in all
> three output sets. So I really do want a good vector I/O toolkit.
>

Copying this payload doesn't necessarily make sense as I pointed out
previously.  If it does, and you don't need to pass vector + payload through
normal Mahout code, then it is pretty trivial for you to devise your own
writable data structure as Sean suggested.
Your data structure can include a VectorWritable along with anything else
you like.


> p.s. If you understand the math of why 2 flat-distribution random numbers
> added create a pyramidal distribution, please write. I 'm attempting to
> reverse this effect. goksron@gmail.com
>

This is a consequence of the law of large numbers.  The distribution of a
sum of a number of random variables drawn independently
from the same base distribution with finite variance will tend uniformly to
the normal distribution that has variance equal to the base
variance multiplied by the number of elements being summed.

The convergence is very quick.  In fact, the sum of 12 uniform [-0.5, 0.5]
deviates was often used in the dark ages
(aka the golden years) of computing as a way to quickly generate a unit
normal deviate.  The cumulative distribution of
such a sum is a piecewise 12th order polynomial that tracks the normal
distribution very closely.

I will put up a more detailed explanation on my blog where I can draw pretty
pictures and write mathematical notation, but the
crux of the argument that if you are adding two random variables x and y,
then the region where there is non-zero probability is
the square [0,1] x [0,1].  For a given value of x + y = z, there is a
diagonal line where that value holds and x and y are in that
square.  Where z <= 0 or z>=2 that intersection vanishes and for 0 < z < 2,
that intersection varies in length.  The probability
of the sum having some particular value z is proportional to the length of
that intersection.  As you can imagine, the intersection
varies in size linearly and it reaches a maximum where z = 1.

For the sum of three random variables, we now have the intersection of a
cubical region with a plane and the probability is
proportional to the area of that intersection.  This takes on a more complex
form than with two variables, being composed of
regions with a quadratic form depending on whether we are near the ends or
the middle of the cube.

As to your question of how to get ride of the non-uniformity, it is almost
always a bad idea to try to eliminate this with random
projections.

Much better is to simply use a normal distribution instead of a uniform
distribution.  There are several reasons for this.

First, the sum of two normally distributed variables is also normally
distributed since the normal distribution is the fixed
point for random variables under addition.  This means that you don't have
to worry about what the distribution of your sums
will be; you already know.

Secondly, if you are dealing with random projections, then the distribution
of the sum of products of random variables becomes
very important.  With the normal distribution, you can pretty easily
determine what this distribution is.  If you started with uniform
distributions, you would have a much harder time of it and have to resort to
approximation by normal distributions.

Some people think that random projections should be entirely composed of
positive values.  A better way to do this would be to
use a log-linear (soft-max) link function to project R^n into the positive
orthant.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message