mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jones <>
Subject Re: LSI, cosine and others which use vectors
Date Wed, 24 Jun 2009 03:05:25 GMT
tks Ted, but if its a live system, and you have 10 million documents, then isn't the computation
on the fly going to be a pain, if you add say 1000 docs per hour or whatever, which is why
I was assuming that its a batch process.

Also I think I have worked out what I meant about the relationships between the words themselves,
I think I was looking to build a term-term matrix instead of a term-doc, whereby I have the
freq of occurence of each word alongside each other word in a doc.(I guess easy way to start
is that the two words can co-occur anywhere in the doc). If done, hopefully the 'distance'
between the two vectors should give me a relative relationship. I realise lots of problems
with this approach. i.e how don't know how the words are related...I just know that they are.


From: Ted Dunning <>
Sent: Wednesday, 24 June, 2009 1:52:41
Subject: Re: LSI, cosine and others which use vectors

There are two kinds of changes here.

The first kind is when a single document changes.  That will change the
distances between that document and others, but it won't change the
distances between two other documents.  Most importantly, it won't change
the distance between queries and other documents.

The second kind of change is due to the first and is relatively
unavoidable.  When a document changes, almost inevitably the corpus word
frequencies will change as a result.  This changes the weightings applied to
particular terms in documents.  When you have many documents of which few
change these changes will be small enough to ignore.

In practice, you don't much care about what has changed because a live
system computes all similarities or distances on the fly based on the
current state.   If the similarities that you have not yet computed change,
you don't care.

On Tue, Jun 23, 2009 at 5:01 PM, Paul Jones <>wrote:

> Yes another question, am going through a rapid learning curve...
> All these vector based systems, which require you to build a term-doc etc,
> are they of any use in a system where the data is changing, i.e lets assume
> the docs are webpages, which are being crawled, and hence updated. Surely if
> there is a vector diagram being formed, then the position of these vectors
> changes based on the changes (size, content) of the entire matrix, or am I
> missing something here.
> If the above is correct, then is a actual live project how is this done,
> are distances worked out on a per-day type of basis, and the indexes then
> updated ?
> Paul

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message