mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kris Jack <mrkrisj...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Tue, 08 Jun 2010 15:11:47 GMT
Hi Olivier,

Thanks for your suggestions.  I have over 10 million documents and they have
quite a lot of meta-data associated with them including rather large text
fields.  It is possible to tweak the moreLikeThis function from solr.  I
have tried changing the parameters (http://wiki.apache.org/solr/MoreLikeThis)
but am not managing to get results in under 300ms without sacrificing the
quality of the results too much.

I suspect that there would be gains to be made from reducing the
dimensionality of the feature vectors before indexing with lucene so I may
give that a try.  I'll keep you posted if I come up with other solutions.

Thanks,
Kris



2010/6/8 Olivier Grisel <olivier.grisel@ensta.org>

> 2010/6/8 Kris Jack <mrkrisjack@gmail.com>:
> > Hi everyone,
> >
> > I currently use lucene's moreLikeThis function through solr to find
> > documents that are related to one another.  A single call, however, takes
> > around 4 seconds to complete and I would like to reduce this.  I got to
> > thinking that I might be able to use Mahout to generate a document
> > similarity matrix offline that could then be looked-up in real time for
> > serving.  Is this a reasonable use of Mahout?  If so, what functions will
> > generate a document similarity matrix?  Also, I would like to be able to
> > keep the text processing advantages provided through lucene so it would
> help
> > if I could still use my lucene index.  If not, then could you recommend
> any
> > alternative solutions please?
>
> How many documents do you have in your index? Have you tried to tweak
> the MoreLikeThis parameters ? (I don't know if it's possible using the
> solr interface, I use it directly using the lucene java API)
>
> For instance you can trade off recall for speed by decreasing the
> number of terms to use in the query and trade recall for precision and
> speed by increasing the percentage of terms that should match.
>
> You could also use Mahout implementation of SVD to build low
> dimensional semantic vectors representing your documents (a.k.a.
> Latent Semantic Indexing) and then index those transformed frequency
> vectors in a dedicated lucene index (or document field provided you
> name the resulting terms with something that does not match real life
> terms present in other). However using standard SVD will probably
> result in dense (as opposed to sparse) low dimensional semantic
> vectors. I don't think lucene's lookup performance is good with dense
> frequency vectors even though the number of terms is greatly reduced
> by SVD. Hence it would probably be better to either threshold the top
> 100 absolute values of each semantic vectors before indexing (probably
> the simpler solution) or using a sparsifying penalty contrained
> variant of SVD / LSI. You should have a look at the literature on
> sparse coding or sparse dictionary learning, Sparse-PCA and more
> generally L1 penalty regression methods such as the Lasso and LARS. I
> don't know about any library  for sparse semantic coding of document
> that works automatically with lucene. Probably some non trivial coding
> is needed there.
>
> Another alternative is finding low dimensional (64 or 32 components)
> dense codes and then binary thresholding then and store integer code
> in the DB or the lucene index and then build smart exact match queries
> to find all document lying in the hamming ball of size 1 or 2 of the
> reference document's binary code. But I think this approach while
> promising for web scale document collections is even more experimental
> and requires very good code low dim encoders (I don't think linear
> models such as SVD are good enough for reducing sparse 10e6 components
> vectors to dense 64 components vectors, non linear encoders such as
> Stacked Restricted Boltzmann Machines are probably a better choice).
>
> In any case let us know about your results, I am really interested on
> practical yet scalable solutions to this problem.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>



-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message