mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jones <paul_jone...@yahoo.co.uk>
Subject Re: LSI, cosine and others which use vectors
Date Wed, 24 Jun 2009 13:02:02 GMT
>>> inline




________________________________
From: Ted Dunning <ted.dunning@gmail.com>
To: mahout-user@lucene.apache.org
Sent: Wednesday, 24 June, 2009 7:37:38
Subject: Re: LSI, cosine and others which use vectors

On Tue, Jun 23, 2009 at 8:05 PM, Paul Jones <paul_jonez99@yahoo.co.uk>wrote:

> tks Ted, but if its a live system, and you have 10 million documents, then
> isn't the computation on the fly going to be a pain, if you add say 1000
> docs per hour or whatever, which is why I was assuming that its a batch
> process.
>

That really depends on what you are doing.  Mostly, it turns out vastly
better.  For instance, if you have 10^7 documents as you say, then you have
10^14 distances to think about.   In fact, most of those distances are large
enough that you can ignore them so it is nice to have a (notional/virtual)
sparse similarity matrix instead.  The cost of computing the entire doc-doc
matrix is substantial and more than you can typically do on the fly, but you
almost never need the entire similarity matrix.  So it is cheaper in many
cases to just compute the part that you need on the fly.

>>>> I don't follow how would you do just part of the matrix, i.e if you have
created the term-doc, surely the weighted data would need to be weighted across the whole
set, in order to give you relevance, not sure how you would slice through a subset of the
matrix you want to work with. 

There are exceptions, notably in recommendations where it pays to compute a
large number of item-item similarities, sparsify the result and store it as
a binary vector.

[snip]

Indeed.

If you look at your problem again, there are some interesting connections
between what you want to do and matrix algebra.  For instance, let's call
your document x term matrix of counts A.  Then the document level
cooccurrence matrix is A' A.  It often helps to use inverse-document
frequency scaled values instead of  counters here or to post-process the raw
counts to find anomalously large one.  The next step, however, is to not
just find words that co-occur, but to find words that have similar patterns
of cooccurrence with each other.  If two words are interchangeable, it is
reasonable to expect that they wouldn't ever occur together, but you would
expect that they would co-occur with other words in just the same way.  You
can find words that have similar cooccurrence patterns by comparing rows (or
columns) of the cooccurrence matrix with other rows of the smae matrix.
Using a matrix product does this wholesale:

           (A' A)' (A' A) = A'A A'A

If you decompose A using singular values, you get A = U' s V so A' A = V'
s^2 V and A' A A' A = V' s^4 V.  What will happen here is that the vectors
with the largest few singular values will dominate so you can approximate
this result by doing a term level random walk.

     while (many repetitions) {
         Term t = random term
         d = random document containing t
         u1 = random word in d
         d = random document containing u1
         u2 = random word in d
         count([t,u2])
     }

Pairs that have lots of counts will be the related words you want.  Note how
it is very easy to implement this using a doc x term matrix, especially one
implemented using something like Lucene.


>>>> Okay aside from being confused with matrix algebra :-), am confused with
the "easy to implement using a doc x term matrix", i.e not sure how a doc-term matrix would
work out the similiarity between words, is it not working out the occurrence of words in a
doc. Maybe I am misunderstanding...Lets say I have a matrix built, where the docs are the
columns, and the words as rows, now my limited understanding from what I have read says that
this matrix can be represented as a number of vectors, eg lets say we have one document, with
3 words, then the x/y/z axis will represent each word and its freq of occurence, and hence
the point in space forming the vector depicts this word related to that document.

And this can be expanded. Now if we have 2 documents, with 2 more words, we have another point.
The distance between them shows how similar they are, and hence how similar the document is
too each other. 

So far so good, but I am unsure how this translates in showing how similar the words are themselves,
i.e co-occurence, would that not have to be a term-term matrix
[snip]


Tks again, 

Paul



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message