On Tue, Jun 23, 2009 at 8:05 PM, Paul Jones <paul_jonez99@yahoo.co.uk>wrote:
> tks Ted, but if its a live system, and you have 10 million documents, then
> isn't the computation on the fly going to be a pain, if you add say 1000
> docs per hour or whatever, which is why I was assuming that its a batch
> process.
>
That really depends on what you are doing. Mostly, it turns out vastly
better. For instance, if you have 10^7 documents as you say, then you have
10^14 distances to think about. In fact, most of those distances are large
enough that you can ignore them so it is nice to have a (notional/virtual)
sparse similarity matrix instead. The cost of computing the entire docdoc
matrix is substantial and more than you can typically do on the fly, but you
almost never need the entire similarity matrix. So it is cheaper in many
cases to just compute the part that you need on the fly.
There are exceptions, notably in recommendations where it pays to compute a
large number of itemitem similarities, sparsify the result and store it as
a binary vector.
Also I think I have worked out what I meant about the relationships between
> the words themselves, I think I was looking to build a termterm matrix
> instead of a termdoc, whereby I have the freq of occurence of each word
> alongside each other word in a doc.(I guess easy way to start is that the
> two words can cooccur anywhere in the doc).
>
This is an excellent first step for word relationships.
> If done, hopefully the 'distance' between the two vectors should give me a
> relative relationship. I realise lots of problems with this approach. i.e
> how don't know how the words are related...I just know that they are.
>
Indeed.
If you look at your problem again, there are some interesting connections
between what you want to do and matrix algebra. For instance, let's call
your document x term matrix of counts A. Then the document level
cooccurrence matrix is A' A. It often helps to use inversedocument
frequency scaled values instead of counters here or to postprocess the raw
counts to find anomalously large one. The next step, however, is to not
just find words that cooccur, but to find words that have similar patterns
of cooccurrence with each other. If two words are interchangeable, it is
reasonable to expect that they wouldn't ever occur together, but you would
expect that they would cooccur with other words in just the same way. You
can find words that have similar cooccurrence patterns by comparing rows (or
columns) of the cooccurrence matrix with other rows of the smae matrix.
Using a matrix product does this wholesale:
(A' A)' (A' A) = A'A A'A
If you decompose A using singular values, you get A = U' s V so A' A = V'
s^2 V and A' A A' A = V' s^4 V. What will happen here is that the vectors
with the largest few singular values will dominate so you can approximate
this result by doing a term level random walk.
while (many repetitions) {
Term t = random term
d = random document containing t
u1 = random word in d
d = random document containing u1
u2 = random word in d
count([t,u2])
}
Pairs that have lots of counts will be the related words you want. Note how
it is very easy to implement this using a doc x term matrix, especially one
implemented using something like Lucene.
(note that you probably want to select documents preferentially that contain
the word of interest in excess of expected rates and you should pick new
words preferentially that occur anomalously often in the document.
Otherwise, this algorithm will focus in on common words)
