mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Adding dimensions to an existing TF-IDF vector
Date Fri, 24 Jun 2011 17:03:28 GMT
It is quite possible.

If the new columns represent a relatively small contribution rather than a
wholesale change in the statistics of the corpus (which is almost always
true) then you can just add these columns and compute IDF weights for the
new terms based on the updated corpus statistics.  You don't need to update
the old IDF weights because the number of documents isn't going to change a
lot and the old terms probably occur in the new documents at about the same
rate anyway.

Of course, you do have to go back through an add the zero columns to the old
data.

One work-around is to use really, really big vectors to start with and hope
that nobody ever accidentally fills in one of these vectors.  This is cool
with sparse vectors since zeros aren't store so all of the unused columns
have no impact.  New vectors can have new columns, but old ones need no
change since they effectively already have these columns.

A second possible work-around is to use the hashed encoding.  This costs a
bit more for encoding, but it gives you static vector sizes.  For some
algorithms, this is a huge win (SGD for example where we need to allocate a
dense matrix).


On Fri, Jun 24, 2011 at 8:52 AM, Mark <static.void.dev@gmail.com> wrote:

> Is it possible to add more dimensions to an existing TF-IDF vector?  If so
> how would it be possible to determine what appropriate weighting to give to
> these new fields to make sure its not too much/too little?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message