mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Using mahout to cluster terms in Lucene
Date Wed, 30 Sep 2009 08:43:09 GMT
It's not necessarily the case that if the nearest point to pointA in a
collection of points is pointB, that the nearest point to pointB is pointA,
right?  Even in one dimension, if your three points are {0, 1, 1.1}, the
nearest point to 0 is 1, but the nearest point to 1 is 1.1.

I'm not sure if this invalidates your desire to have some sort of
conceptual hierarchy in your clustering, but just because metrics
are symmetric, doesn't mean that iterating nearest(nearest(...(A)...)
repeats quickly (it doesn't even need to converge).

  -jake

On Wed, Sep 30, 2009 at 12:42 AM, Shashikant Kore <shashikant@gmail.com>wrote:

> Ted,
>
> Some time back I had thought about this idea. But, I sensed one
> potential problem with this approach. The resulting co-occurrence will
> be bi-directional. For document this property is fine, but for terms,
> it may not be desirable in some cases.
>
> For example, if "Roger Federer" is the keyword, the co-occuring terms
> will be "Tennis", "Grand slam", "Wimbledon", etc. But, for "Tennis",
> the list of top co-occurring terms may not include "Roger Federer."
>
> Is there a way to identify the directional relationship among terms?
>
> Of course, this was just a thought and no real code was written to
> verify the assertion.
>
> --shashi
>
> On Wed, Sep 30, 2009 at 2:43 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > Another way to do this through the back door is to transpose the document
> > set so that you have a list of documents for each term.  Index this and
> > cluster it just as if it were normal documents and you will have a form
> of
> > term clustering.
> >
> > On Tue, Sep 29, 2009 at 1:05 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
> >
> >> The LDA implementation kind of clusters on terms to generate topics.  It
> >> sounds like you want some co-occurrence analysis, I'm not sure that the
> >> clustering algorithms are best for that, but perhaps others have
> insight.
> >>  I could imagine doing this with HBase or Pig and just keeping a matrix
> >> where each cell kept track of the number of times both terms appear in a
> >> document (or even within some window in a document).
> >>
> >>
> >>
> >> On Sep 29, 2009, at 8:57 AM, Ole-Martin Mørk wrote:
> >>
> >>  Hi.
> >>> I have been using org.apache.mahout.utils.vectors.lucene.Driver
> >>> and org.apache.mahout.clustering.kmeans.KMeansDriver to cluster
> documents
> >>> in
> >>> our Lucene index and it works great! I am wondering though, is it
> possible
> >>> to use Mahout to cluster terms?
> >>>
> >>> I want to cluster terms that often appear in the same documents.
> >>>
> >>> Thank you.
> >>>
> >>> --
> >>> Ole-Martin Mørk
> >>> http://twitter.com/olemartin
> >>> http://flickr.com/olemartin
> >>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com/
> >>
> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> >> Solr/Lucene:
> >> http://www.lucidimagination.com/search
> >>
> >>
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message