Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 96842 invoked from network); 24 Jun 2009 17:41:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Jun 2009 17:41:12 -0000 Received: (qmail 85452 invoked by uid 500); 24 Jun 2009 17:41:22 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 85397 invoked by uid 500); 24 Jun 2009 17:41:22 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 85387 invoked by uid 99); 24 Jun 2009 17:41:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 17:41:22 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 74.125.92.27 as permitted sender) Received: from [74.125.92.27] (HELO qw-out-2122.google.com) (74.125.92.27) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 17:41:09 +0000 Received: by qw-out-2122.google.com with SMTP id 5so274123qwi.53 for ; Wed, 24 Jun 2009 10:40:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=PV+x30P5JGhtsWnoAgGj8/Md0+0yv+1wXMFFl+RVOUs=; b=gtolengzJrsyGzK/BGk9qp25zfl/KLiUBYaJCxfeFOukH7WFXyfKR91sWiBzGYwBgm VqnYCFk6MPQ9cDQ8OisEorivlOhPgoxxTeixUa1ZLemT9bkIFusdlUWGOvHRoq4Lj/ck MByGRbN28uRxOmEbcgVI99OVEt1G55NZ3zp/k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=TWMbZWJIUoUSW3ffVtrHneN3O6+H1+9aPww5xWpszL44KWDN8TPh59J/8iN1krbGiU m4TtGPxw7kR8gTJcF+QryAzKIvHxj0H8zMoVsV7jr3IfsirnugmQ+9Fi5BeMRXTgTqPI xZDyguY2QQpRM9TdTtBAqTLs1QRlMWujcDWn8= MIME-Version: 1.0 Received: by 10.151.111.11 with SMTP id o11mr2853015ybm.312.1245865248073; Wed, 24 Jun 2009 10:40:48 -0700 (PDT) In-Reply-To: <523915.19784.qm@web24612.mail.ird.yahoo.com> References: <361287.67300.qm@web24613.mail.ird.yahoo.com> <286DF62E-3DCF-49AB-8AA1-9A0B29C9971A@apache.org> <448825.32301.qm@web24605.mail.ird.yahoo.com> <23093.42755.qm@web24605.mail.ird.yahoo.com> <770A3644-4DFB-4863-9E09-692276C0498C@gmail.com> <379622.20816.qm@web24606.mail.ird.yahoo.com> <523915.19784.qm@web24612.mail.ird.yahoo.com> From: Ted Dunning Date: Wed, 24 Jun 2009 10:40:28 -0700 Message-ID: Subject: Re: mahout PLSI (with some lucene, thrown in) To: mahout-user@lucene.apache.org, lucene-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001517573d788ca8f9046d1b9b22 X-Virus-Checked: Checked by ClamAV on apache.org --001517573d788ca8f9046d1b9b22 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I just read the introduction paper and was pleased to see your reference to Robert Hecht Nielsen's work. They omitted, however, a large body of work that predated their other references by nearly a decade. The algorithm presented is essentially identical to so-called one-step learning that derived from early work at HN= C Software and was refined during my tenure as Chief Scientist at Aptex. The only important difference between Random Indexing and our earlier work relates to the domain of the original vectors. IN our case, we mostly used vectors sampled from multi-dimensional unit normal distribution, in Random Indexing, they use ternary or binary vectors. We also experimented with binary vectors, but the hardware of the time favored the continuous representation so we focussed on that formulation. Also, the algorithm presented is essentially one iteration of a power law extraction of singular vectors. As presented, this algorithm cannot be use= d with more than 2-3 iterations because it collapses onto the dominant eigenvectors. Lanczos gave an algorithm that avoids this at the cost of higher complexity. When used for a single iteration, sufficient informatio= n from the secondary eigenvectors is retained in the form of the original random initial conditions to avoid problems. It should also be noted that even without the context vector training, useful performance can be obtained. These consideratons make it clear that random indexing and context vector techniques should be considered as an alternative formulatio= n of LSA and other SVD systems. There are also close connections with Bayesian techniques such as LDA or MDCA. Buntine and Jakulin had an interesting article on that where they presented an ontology of matrix decomposition techniques. Random indexing fits nicely as a sub-category of LSA. In general, SVD related techniques like Random Indexing can have slightly better recall in some situations, but generally this difference is difficul= t to detect. The old MatchPlus system from HNC was competitive with the best retrieval systems, but was never superior. Here are some references that you may find interesting: http://citeseerx.ist.psu.edu/viewdoc/download?doi=3D10.1.1.87.7893&rep=3Dre= p1&type=3Dpdf http://www.google.com/patents?hl=3Den&lr=3D&vid=3DUSPAT5619709&id=3D4kkhAAA= AEBAJ&oi=3Dfnd&dq=3DWilliam+Caid http://www.google.com/patents?hl=3Den&lr=3D&vid=3DUSPAT5794178&id=3DkZogAAA= AEBAJ&oi=3Dfnd&dq=3DWilliam+Caid http://www.sciencedirect.com/science?_ob=3DArticleURL&_udi=3DB6VC8-3YMFVB3-= 1B&_user=3D7971165&_rdoc=3D1&_fmt=3D&_orig=3Dsearch&_sort=3Dd&_docanchor=3D= &view=3Dc&_searchStrId=3D938841321&_rerunOrigin=3Dscholar.google&_acct=3DC0= 00050221&_version=3D1&_urlVersion=3D0&_userid=3D7971165&md5=3D0ac86651fa508= bb9b4157b382f281177 http://portal.acm.org/citation.cfm?id=3D146565.146569 http://www.google.com/patents?hl=3Den&lr=3D&vid=3DUSPATAPP10868538&id=3DL6y= fAAAAEBAJ&oi=3Dfnd&dq=3DWilliam+Caid http://www.google.com/patents?hl=3Den&lr=3D&vid=3DUSPAT6134532&id=3DJ2kGAAA= AEBAJ&oi=3Dfnd&dq=3DWilliam+Caid http://spiedl.aip.org/getabs/servlet/GetabsServlet?prog=3Dnormal&id=3DPSISD= G002606000001000372000001&idtype=3Dcvips&gifs=3Dyes On Wed, Jun 24, 2009 at 9:40 AM, Paul Jones wrote= : > Had a look at it sometime ago, but admitedly skimmed over it. Just read i= t > again, looks good, allows dimension reduction with ease, and hence looks > scalable. > > tks > > Paul > > > > > ________________________________ > From: Grant Ingersoll > To: mahout-user@lucene.apache.org > Sent: Wednesday, 24 June, 2009 12:34:46 > Subject: Re: mahout PLSI (with some lucene, thrown in) > > Random FYI: http://code.google.com/p/semanticvectors/ came up on the > Lucene mailing list yesterday and it sounds interesting, plus BSD license= ... > > -Grant > > On Jun 23, 2009, at 7:56 PM, Paul Jones wrote: > > > Yup, I see that wordnet has also been "ported" to a lucene index, and > hence pulling the hyponyms works great. > > > > tks > > > > Paul > > > > > > > > > > ________________________________ > > From: Tommy Chheng > > To: mahout-user@lucene.apache.org > > Sent: Tuesday, 23 June, 2009 23:19:25 > > Subject: Re: mahout PLSI (with some lucene, thrown in) > > > > Have you looked at WordNet to get the hypohyms? > > > > Tommy > > > > On Jun 23, 2009, at 3:09 PM, Paul Jones wrote: > > > >> Okay, have seen the difficulty (apart from the maths :-)). > >> > >> I guess "similar" can mean many things, i.e hypohyms, but also words > such as hot...cold are also "related", hence to solve my little problem I= am > wondering if there is a easier way, i.e to use things like existing hypon= yms > relations which exist (wordnet and the like) , and/or if they do not then= I > guess using something similar to a "google distance measure" may help in > "adding" new words to the system.... > >> > >> Paul > >> > >> > >> > >> > >> ________________________________ > >> From: Ted Dunning > >> To: mahout-user@lucene.apache.org > >> Sent: Tuesday, 23 June, 2009 18:00:12 > >> Subject: Re: mahout PLSI (with some lucene, thrown in) > >> > >> Yes. This can be done. It isn't necessarily real simple to do. > >> > >> See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.56.7275 = for > an > >> old (but still pretty good) example. > >> > >> On Tue, Jun 23, 2009 at 6:45 AM, Paul Jones >wrote: > >> > >>> Imagine we have crawled 100K webpages, and we have 100 pages which sh= ow > >>> "red" and 100 which show "crimson" and then 100 which show both "red > and > >>> crimson" is there a way to deduce that there maybe (albeit weak) > >>> relationship between red AND crimson. Of course we can pre-seed this > info, > >>> which then gets weighted by actual results. > >>> > >> > >> > >> > > > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > > > > --=20 Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax) --001517573d788ca8f9046d1b9b22--