mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: mahout PLSI (with some lucene, thrown in)
Date Wed, 24 Jun 2009 17:40:28 GMT
I just read the introduction paper and was pleased to see your reference to
Robert Hecht Nielsen's work.

They omitted, however, a large body of work that predated their other
references by nearly a decade.  The algorithm presented is essentially
identical to so-called one-step learning that derived from early work at HNC
Software and was refined during my tenure as Chief Scientist at Aptex.  The
only important difference between Random Indexing and our earlier work
relates to the domain of the original vectors.  IN our case, we mostly used
vectors sampled from multi-dimensional unit normal distribution, in Random
Indexing, they use ternary or binary vectors.  We also experimented with
binary vectors, but the hardware of the time favored the continuous
representation so we focussed on that formulation.

Also, the algorithm presented is essentially one iteration of a power law
extraction of singular vectors.  As presented, this algorithm cannot be used
with more than 2-3 iterations because it collapses onto the dominant
eigenvectors.  Lanczos gave an algorithm that avoids this at the cost of
higher complexity.  When used for a single iteration, sufficient information
from the secondary eigenvectors is retained in the form of the original
random initial conditions to avoid problems.  It should also be noted that
even without the context vector training, useful performance can be
obtained.  These consideratons make it clear that random indexing and
context vector techniques should be considered as an alternative formulation
of LSA and other SVD systems.

There are also close connections with Bayesian techniques such as LDA or
MDCA.  Buntine and Jakulin had an interesting article on that where they
presented an ontology of matrix decomposition techniques.  Random indexing
fits nicely as a sub-category of LSA.

In general, SVD related techniques like Random Indexing can have slightly
better recall in some situations, but generally this difference is difficult
to detect.  The old MatchPlus system from HNC was competitive with the best
retrieval systems, but was never superior.

Here are some references that you may find interesting:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.7893&rep=rep1&type=pdf
http://www.google.com/patents?hl=en&lr=&vid=USPAT5619709&id=4kkhAAAAEBAJ&oi=fnd&dq=William+Caid
http://www.google.com/patents?hl=en&lr=&vid=USPAT5794178&id=kZogAAAAEBAJ&oi=fnd&dq=William+Caid
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VC8-3YMFVB3-1B&_user=7971165&_rdoc=1&_fmt=&_orig=search&_sort=d&_docanchor=&view=c&_searchStrId=938841321&_rerunOrigin=scholar.google&_acct=C000050221&_version=1&_urlVersion=0&_userid=7971165&md5=0ac86651fa508bb9b4157b382f281177
http://portal.acm.org/citation.cfm?id=146565.146569
http://www.google.com/patents?hl=en&lr=&vid=USPATAPP10868538&id=L6yfAAAAEBAJ&oi=fnd&dq=William+Caid
http://www.google.com/patents?hl=en&lr=&vid=USPAT6134532&id=J2kGAAAAEBAJ&oi=fnd&dq=William+Caid
http://spiedl.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=PSISDG002606000001000372000001&idtype=cvips&gifs=yes




On Wed, Jun 24, 2009 at 9:40 AM, Paul Jones <paul_jonez99@yahoo.co.uk>wrote:

> Had a look at it sometime ago, but admitedly skimmed over it. Just read it
> again, looks good, allows dimension reduction with ease, and hence looks
> scalable.
>
> tks
>
> Paul
>
>
>
>
> ________________________________
> From: Grant Ingersoll <gsingers@apache.org>
> To: mahout-user@lucene.apache.org
> Sent: Wednesday, 24 June, 2009 12:34:46
> Subject: Re: mahout PLSI (with some lucene, thrown in)
>
> Random FYI: http://code.google.com/p/semanticvectors/ came up on the
> Lucene mailing list yesterday and it sounds interesting, plus BSD license...
>
> -Grant
>
> On Jun 23, 2009, at 7:56 PM, Paul Jones wrote:
>
> > Yup, I see that wordnet has also been "ported" to a lucene index, and
> hence pulling the hyponyms works great.
> >
> > tks
> >
> > Paul
> >
> >
> >
> >
> > ________________________________
> > From: Tommy Chheng <tommy@peoplejar.com>
> > To: mahout-user@lucene.apache.org
> > Sent: Tuesday, 23 June, 2009 23:19:25
> > Subject: Re: mahout PLSI (with some lucene, thrown in)
> >
> > Have you looked at WordNet to get the hypohyms?
> >
> > Tommy
> >
> > On Jun 23, 2009, at 3:09 PM, Paul Jones wrote:
> >
> >> Okay, have seen the difficulty (apart from the maths :-)).
> >>
> >> I guess "similar" can mean many things, i.e hypohyms, but also words
> such as hot...cold are also "related", hence to solve my little problem I am
> wondering if there is a easier way, i.e to use things like existing hyponyms
> relations which exist (wordnet and the like) , and/or if they do not then I
> guess using something similar to a "google distance measure" may help in
> "adding" new words to the system....
> >>
> >> Paul
> >>
> >>
> >>
> >>
> >> ________________________________
> >> From: Ted Dunning <ted.dunning@gmail.com>
> >> To: mahout-user@lucene.apache.org
> >> Sent: Tuesday, 23 June, 2009 18:00:12
> >> Subject: Re: mahout PLSI (with some lucene, thrown in)
> >>
> >> Yes.  This can be done.  It isn't necessarily real simple to do.
> >>
> >> See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275 for
> an
> >> old (but still pretty good) example.
> >>
> >> On Tue, Jun 23, 2009 at 6:45 AM, Paul Jones <paul_jonez99@yahoo.co.uk
> >wrote:
> >>
> >>> Imagine we have crawled 100K webpages, and we have 100 pages which show
> >>> "red" and 100 which show "crimson" and then 100 which show both "red
> and
> >>> crimson" is there a way to deduce that there maybe (albeit weak)
> >>> relationship between red AND crimson. Of course we can pre-seed this
> info,
> >>> which then gets weighted by actual results.
> >>>
> >>
> >>
> >>
> >
> >
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message