mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Starina <david.star...@gmail.com>
Subject Re: Document similarity
Date Tue, 23 Feb 2016 21:01:52 GMT
Guys, one more question ... Are there some incremental methods to do this?
I don't want to run the whole job again once a new document is added. In
case of LDA ... I guess the best way is to calculate the topics on the new
document using the topics from the previous LDA run ... And then every once
in a while to recalculate the topics with the new documents?

On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> Something we are working on for purely content based similarity is using a
> KNN engine (search engine) but creating features from word2vec and an NER
> (Named Entity Recognizer).
>
> putting the generated features into fields of a doc can really help with
> similarity because w2v and NER create semantic features. You can also try
> n-grams or skip-grams. These features are not very helpful for search but
> for  similarity they work well.
>
> The query to the KNN engine is a document, each field mapped to the
> corresponding field of the index. The result is the k nearest neighbors to
> the query doc.
>
>
> > On Feb 14, 2016, at 11:05 AM, David Starina <david.starina@gmail.com>
> wrote:
> >
> > Charles, thank you, I will check that out.
> >
> > Ted, I am looking for semantic similarity. Unfortunately, I do not have
> any
> > data on the usage of the documents (if by usage you mean user behavior).
> >
> > On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> >> Did you want textual similarity?
> >>
> >> Or semantic similarity?
> >>
> >> The actual semantics of a message can be opaque from the content, but
> clear
> >> from the usage.
> >>
> >>
> >>
> >> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlescearl@me.com>
> wrote:
> >>
> >>> David,
> >>> LDA or LSI can work quite nicely for similarity (YMMV of course
> depending
> >>> on the characterization of your documents).
> >>> You basically use the dot product of the square roots of the vectors
> for
> >>> LDA -- if you do a search for Hellinger or Bhattachararyya distance
> that
> >>> will lead you to a good similarity or distance measure.
> >>> As I recall, Spark does provide an LDA implementation. Gensim provides
> a
> >>> API for doing LDA similarity out of the box. Vowpal Wabbit is also
> worth
> >>> looking at, particularly for a large dataset.
> >>> Hope this is useful.
> >>> Cheers
> >>>
> >>> Sent from my iPhone
> >>>
> >>>> On Feb 14, 2016, at 8:14 AM, David Starina <david.starina@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I need to build a system to determine N (i.e. 10) most similar
> >> documents
> >>> to
> >>>> a given document. I have some (theoretical) knowledge of Mahout
> >>> algorithms,
> >>>> but not enough to build the system. Can you give me some suggestions?
> >>>>
> >>>> At first I was researching Latent Semantic Analysis for the task, but
> >>> since
> >>>> Mahout doesn't support it, I started researching some other options.
I
> >>> got
> >>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
> >> allocation)
> >>>> in Mahout to achieve similar and even better results.
> >>>>
> >>>> However ... and this is where I got confused ... LDA is a clustering
> >>>> algorithm. However, what I need is not to cluster the documents into
N
> >>>> clusters - I need to get a matrix (similar to TF-IDF) from which I can
> >>>> calculate some sort of a distance for any two documents to get N most
> >>>> similar documents for any given document.
> >>>>
> >>>> How do I achieve that? My idea was (still mostly theoretical, since
I
> >>> have
> >>>> some problems with running the LDA algorithm) to extract some number
> of
> >>>> topics with LDA, but I need not cluster the documents with the help
of
> >>> this
> >>>> topics, but to get the matrix of documents as one dimention and topics
> >> as
> >>>> the other dimension. I was guessing I could then use this matrix an
an
> >>>> input to row-similarity algorithm.
> >>>>
> >>>> Is this the correct concept? Or am I missing something?
> >>>>
> >>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve
> >>>> similar results on Spark?
> >>>>
> >>>>
> >>>> Thanks in advance,
> >>>> David
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message