mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: LSI, cosine and others which use vectors
Date Wed, 24 Jun 2009 00:52:41 GMT
There are two kinds of changes here.

The first kind is when a single document changes.  That will change the
distances between that document and others, but it won't change the
distances between two other documents.  Most importantly, it won't change
the distance between queries and other documents.

The second kind of change is due to the first and is relatively
unavoidable.  When a document changes, almost inevitably the corpus word
frequencies will change as a result.  This changes the weightings applied to
particular terms in documents.  When you have many documents of which few
change these changes will be small enough to ignore.

In practice, you don't much care about what has changed because a live
system computes all similarities or distances on the fly based on the
current state.   If the similarities that you have not yet computed change,
you don't care.

On Tue, Jun 23, 2009 at 5:01 PM, Paul Jones <>wrote:

> Yes another question, am going through a rapid learning curve...
> All these vector based systems, which require you to build a term-doc etc,
> are they of any use in a system where the data is changing, i.e lets assume
> the docs are webpages, which are being crawled, and hence updated. Surely if
> there is a vector diagram being formed, then the position of these vectors
> changes based on the changes (size, content) of the entire matrix, or am I
> missing something here.
> If the above is correct, then is a actual live project how is this done,
> are distances worked out on a per-day type of basis, and the indexes then
> updated ?
> Paul

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message