mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Wed, 09 Jun 2010 17:15:44 GMT
Hi Sebastion,

Thanks for the reference.  I had a look through the paper and it's certainly
very relevant to the problem that I'm trying to solve.  Do you think the CF
functionality could be co-opted to output such document similarities as it
stands or will it require modification?  If it can be used straight off, say
to give the top 25 most related documents for each document, then how would
you suggest that I go about this?

Thanks,
Kris



2010/6/8 Sebastian Schelter <ssc.open@googlemail.com>

> Hi Kris,
>
> actually the code to compute the item-to-item similarities in the
> collaborative filtering part of mahout (which at the first look seems to be
> a totally different problem than yours) is based on a paper that deals with
> computing the pairwise similarity of text documents in a very simple way.
> Maybe that  could be helpful to you:
>
> Elsayed et al: Pairwise Document Similarity in Large Collections with
> MapReduce
>
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf<http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf>
> <
> http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> >
>
> -sebastian
>
>
> 2010/6/8 Kris Jack <mrkrisjack@gmail.com>
>
> > Hi everyone,
> >
> > I currently use lucene's moreLikeThis function through solr to find
> > documents that are related to one another.  A single call, however, takes
> > around 4 seconds to complete and I would like to reduce this.  I got to
> > thinking that I might be able to use Mahout to generate a document
> > similarity matrix offline that could then be looked-up in real time for
> > serving.  Is this a reasonable use of Mahout?  If so, what functions will
> > generate a document similarity matrix?  Also, I would like to be able to
> > keep the text processing advantages provided through lucene so it would
> > help
> > if I could still use my lucene index.  If not, then could you recommend
> any
> > alternative solutions please?
> >
> > Many thanks,
> > Kris
> >
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message