mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Dimensional Reduction via Random Projection: investigations
Date Sun, 03 Jul 2011 21:23:39 GMT
I whipped up a MurmurHash Random and the Gaussian plot is much
cleaner. The MH version is exactly as fast as the java.util.Random-
j.u.R makes a new int in every cycle, and MH makes a new long, so does
half as many cycles.

On Sun, Jul 3, 2011 at 12:12 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> I wasn't thinking when I typed that post.  An orthonormal projection always
> preserves distances since it is just a generalized reflection/rotation.
>  Preserving all dot products (including to self) also implies distances are
> preserved because |x-y|_2 = x \dot x - 2 x \dot y + y \dot y.
>
> On Sat, Jul 2, 2011 at 10:15 PM, Lance Norskog <goksron@gmail.com> wrote:
>
>> All pairwise distances are preserved? There must be a qualifier on
>> pairwise distance algorithms.
>>
>> On Sat, Jul 2, 2011 at 6:49 PM, Lance Norskog <goksron@gmail.com> wrote:
>> > Cool. The plots are fun. The way gaussian spots project into spinning
>> > chains is very educational about entropy.
>> >
>> > For full Random Projection, a lame random number generator
>> > (java.lang.Random) will generate a higher standard deviation than a
>> > high-quality one like MurmurHash.
>> >
>> > On Fri, Jul 1, 2011 at 5:25 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> >> Here is R code that demonstrates what I mean by stunning (aka 15
>> significant
>> >> figures).  Note that I only check dot products here.  From the fact that
>> the
>> >> final transform is orthonormal we know that all distances are preserved.
>> >>
>> >> # make a big random matrix with rank 20
>> >>> a = matrix(rnorm(20000), ncol=20) %*% matrix(rnorm(20000), nrow=20);
>> >>> dim(a)
>> >> [1] 1000 1000
>> >> # random projection
>> >>> y = a %*% matrix(rnorm(30000), ncol=30);
>> >> # get basis for y
>> >>> q = qr.Q(qr(y))
>> >>> dim(q)
>> >> [1] 1000   30
>> >> # re-project b, do svd on result
>> >>> b = t(q) %*% a
>> >>> v = svd(b)$v
>> >>> d = svd(b)$d
>> >> # note how singular values drop like a stone at index 21
>> >>> plot(d)
>> >>> dim(v)
>> >> [1] 1000   30
>> >> # finish svd just for fun
>> >>> u = q %*% svd(b)$u
>> >>> dim(u)
>> >> [1] 1000   30
>> >> # u is orthogonal, right?
>> >>> diag(t(u)%*% u)
>> >>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>> >> # and u diag(d) v' reconstructs a very precisely, right?
>> >>> max(abs(a-u %*% diag(d) %*% t(v)))
>> >> [1] 1.293188e-12
>> >>
>> >> # now project a into the reduced dimensional space
>> >>> aa = a%*%v
>> >>> dim(aa)
>> >> [1] 1000   30
>> >> # check a few dot products
>> >>> sum(aa[1,] %*% aa[2,])
>> >> [1] 6835.152
>> >>> sum(a[1,] %*% a[2,])
>> >> [1] 6835.152
>> >>> sum(a[1,] %*% a[3,])
>> >> [1] 3337.248
>> >>> sum(aa[1,] %*% aa[3,])
>> >> [1] 3337.248
>> >>
>> >> # wow, that's close.  let's try a hundred dot products
>> >>> dot1 = rep(0,100);dot2 = rep(0,100)
>> >>> for (i in 1:100) {dot1[i] = sum(a[1,] * a[i,]); dot2[i] = sum(aa[1,]*
>> >> aa[i,])}
>> >>
>> >> # how close to the same are those?
>> >>> max(abs(dot1-dot2))
>> >> # VERY
>> >> [1] 3.45608e-11
>> >>
>> >>
>> >>
>> >> On Fri, Jul 1, 2011 at 4:54 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> >>
>> >>> Yes.  Been there.  Done that.
>> >>>
>> >>> The correlation is stunningly good.
>> >>>
>> >>>
>> >>> On Fri, Jul 1, 2011 at 4:22 PM, Lance Norskog <goksron@gmail.com>
>> wrote:
>> >>>
>> >>>> Thanks!
>> >>>>
>> >>>> Is this true? - "Preserving pairwise distances" means the relative
>> >>>> distances. So the ratios of new distance:old distance should be
>> >>>> similar. The standard deviation of the ratios gives a rough&ready
>> >>>> measure of the fidelity of the reduction. The standard deviation
of
>> >>>> simple RP should be highest, then this RP + orthogonalization, then
>> >>>> MDS.
>> >>>>
>> >>>> On Fri, Jul 1, 2011 at 11:03 AM, Ted Dunning <ted.dunning@gmail.com>
>> >>>> wrote:
>> >>>> > Lance,
>> >>>> >
>> >>>> > You would get better results from the random projection if
you did
>> the
>> >>>> first
>> >>>> > part of the stochastic SVD.  Basically, you do the random
>> projection:
>> >>>> >
>> >>>> >       Y = A \Omega
>> >>>> >
>> >>>> > where A is your original data, R is the random matrix and Y
is the
>> >>>> result.
>> >>>> >  Y will be tall and skinny.
>> >>>> >
>> >>>> > Then, find an orthogonal basis Q of Y:
>> >>>> >
>> >>>> >       Q R = Y
>> >>>> >
>> >>>> > This orthogonal basis will be very close to the orthogonal
basis of
>> A.
>> >>>>  In
>> >>>> > fact, there are strong probabilistic guarantees on how good
Q is as
>> a
>> >>>> basis
>> >>>> > of A.  Next, you project A using the transpose of Q:
>> >>>> >
>> >>>> >       B = Q' A
>> >>>> >
>> >>>> > This gives you a short fat matrix that is the projection of
A into a
>> >>>> lower
>> >>>> > dimensional space.  Since this is a left projection, it isn't
quite
>> what
>> >>>> you
>> >>>> > want in your work, but it is the standard way to phrase things.
 The
>> >>>> exact
>> >>>> > same thing can be done with left random projection:
>> >>>> >
>> >>>> >      Y = \Omega A
>> >>>> >      L Q = Y
>> >>>> >      B = A Q'
>> >>>> >
>> >>>> > In this form, B is tall and skinny as you would like and Q'
is
>> >>>> essentially
>> >>>> > an orthogonal reformulation of of the random projection.  This
>> >>>> projection is
>> >>>> > about as close as you are likely to get to something that exactly
>> >>>> preserves
>> >>>> > distances.  As such, you should be able to use MDS on B to
get
>> exactly
>> >>>> the
>> >>>> > same results as you want.
>> >>>> >
>> >>>> > Additionally, if you start with the original form and do an
SVD of B
>> >>>> (which
>> >>>> > is fast), you will get a very good approximation of the prominent
>> right
>> >>>> > singular vectors of A.  IF you do that, the first few of these
>> should be
>> >>>> > about as good as MDS for visualization purposes.
>> >>>> >
>> >>>> > On Fri, Jul 1, 2011 at 2:44 AM, Lance Norskog <goksron@gmail.com>
>> >>>> wrote:
>> >>>> >
>> >>>> >> I did some testing and make a lot of pretty charts:
>> >>>> >>
>> >>>> >> http://ultrawhizbang.blogspot.com/
>> >>>> >>
>> >>>> >> If you want to get quick visualizations of your clusters,
this is a
>> >>>> >> great place to start.
>> >>>> >>
>> >>>> >> --
>> >>>> >> Lance Norskog
>> >>>> >> goksron@gmail.com
>> >>>> >>
>> >>>> >
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Lance Norskog
>> >>>> goksron@gmail.com
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message