mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 08 Jun 2010 23:33:14 GMT
Ah yes.  I would love for us to have an implementation of that pairwise
similarity
code.  It would be useful for lots of things in Mahout, yes!

  -jake

On Tue, Jun 8, 2010 at 4:21 PM, Sebastian Schelter
<ssc.open@googlemail.com>wrote:

> I did not wanna say you can use the item-item-similarity code from CF for
> computing the document similarities, I just wanted to point out that these
> problems are closely related and that the paper which the CF code is based
> on is dealing with the computation of pairwise document similarities and
> could therefore be helpful.
>
> -sebastian
>
> 2010/6/9 Jake Mannix <jake.mannix@gmail.com>
>
> > The code in mahout CF is doing that?  I don't think that's right, we
> don't
> > do anything that fancy right now, do we Sean?
> >
> >  -jake
> >
> > On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter
> > <ssc.open@googlemail.com>wrote:
> >
> > > Hi Kris,
> > >
> > > actually the code to compute the item-to-item similarities in the
> > > collaborative filtering part of mahout (which at the first look seems
> to
> > be
> > > a totally different problem than yours) is based on a paper that deals
> > with
> > > computing the pairwise similarity of text documents in a very simple
> way.
> > > Maybe that  could be helpful to you:
> > >
> > > Elsayed et al: Pairwise Document Similarity in Large Collections with
> > > MapReduce
> > >
> > >
> >
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > > <
> > >
> >
> http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> > > >
> > >
> > > -sebastian
> > >
> > >
> > > 2010/6/8 Kris Jack <mrkrisjack@gmail.com>
> > >
> > > > Hi everyone,
> > > >
> > > > I currently use lucene's moreLikeThis function through solr to find
> > > > documents that are related to one another.  A single call, however,
> > takes
> > > > around 4 seconds to complete and I would like to reduce this.  I got
> to
> > > > thinking that I might be able to use Mahout to generate a document
> > > > similarity matrix offline that could then be looked-up in real time
> for
> > > > serving.  Is this a reasonable use of Mahout?  If so, what functions
> > will
> > > > generate a document similarity matrix?  Also, I would like to be able
> > to
> > > > keep the text processing advantages provided through lucene so it
> would
> > > > help
> > > > if I could still use my lucene index.  If not, then could you
> recommend
> > > any
> > > > alternative solutions please?
> > > >
> > > > Many thanks,
> > > > Kris
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message