mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: mahout PLSI (with some lucene, thrown in)
Date Fri, 26 Jun 2009 17:44:04 GMT
Try looking at the random indexing literature.  Sparse binary context
vectors should give you pretty much what you need for the context
similarity.

You can encode your existing synonyms and learn cooccurrence based synonyms
at the same time.  In order to allow you to query using any of these
systems, you would have to increase the size of your index, but unless you
have a huge system, that should be relatively easy.

The idea is that your lucene index would contain separate fields for:

a) the original words

b) the synonym sets for the original words

c) the non-zero content vector components

For a query, you can form three components that correspond to these three
fields and you can include or exclude these at will to find out what works
well.

http://www.sics.se/~mange/random_indexing.html
http://code.google.com/p/semanticvectors/
http://portal.acm.org/citation.cfm?id=146565.146569
http://www.d.umn.edu/~tpederse/Pubs/eacl2006-vector.pdf


On Fri, Jun 26, 2009 at 5:57 AM, Paul Jones <paul_jonez99@yahoo.co.uk>wrote:

> What I had in mind was to
> a) start with existing synonyms
>
> and then
>
> b) add to this system using various algos to determine word distance
>
> I have stayed away from solr, because from what I have read everyone seems
> to pointing to the as a enterprise app, whereas I need something bigger, not
> sure of this is correct
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message