mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: AW: Incremental clustering
Date Thu, 12 May 2011 20:14:12 GMT
Using whatever you used originally would be best.  A map-reduce program will
be slow for small batches, of course.  I don't know if seq2sparse has an
efficient sequential mode.

On Thu, May 12, 2011 at 11:18 AM, Frank Scholten <frank@frankscholten.nl>wrote:

> What do you recommend for vectorizing the new docs? Run seq2sparse on
> a batch of them? Seems there's no code at the moment for quickly
> vectorizing a few new documents based on the existing dictionary.
>
> Frank
>
> On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll <gsingers@apache.org>
> wrote:
> > From what I've seen, using Mahout's existing clustering methods, I think
> most people setup some schedule whereby they cluster the whole collection on
> a regular basis and then all docs that come in the meantime are simply
> assigned to the closest cluster until the next whole collection iteration is
> completed.  There are, of course, other variants one could do, such as kick
> off the whole clustering when some threshold of number of docs is reached.
> >
> > There are other clustering methods, as Benson alluded to, that may better
> support incremental approaches.
> >
> > On May 12, 2011, at 4:53 AM, David Saile wrote:
> >
> >> I am still stuck at this problem.
> >>
> >> Can anyone give me a heads-up on how existing systems handle this?
> >> If a collection of documents is modified, is the clustering recomputed
> from scratch each time?
> >> Or is there in fact any incremental way to handle an evolving set of
> documents?
> >>
> >> I would really appreciate any hint!
> >>
> >> Thanks,
> >> David
> >>
> >>
> >> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> >>
> >>> Not an answer, but a follow-up question:
> >>> I would be interested in the very same thing, but with the possibility
> to assign new sites to existing clusters OR to new ones.
> >>>
> >>> Thanks in advance,
> >>> Ulrich
> >>>
> >>> -----Urspr√ľngliche Nachricht-----
> >>> Von: David Saile [mailto:david@uni-koblenz.de]
> >>> Gesendet: Montag, 9. Mai 2011 11:53
> >>> An: user@mahout.apache.org
> >>> Betreff: Incremental clustering
> >>>
> >>> Hi list,
> >>>
> >>> I am completely new to Mahout, so please forgive me if the answer to my
> question is too obvious.
> >>>
> >>> For a case study, I am working on a simple incremental web crawler
> (much like Nutch) and I want to include a very simple indexing step that
> incorporates clustering of documents.
> >>>
> >>> I was hoping to use some kind of incremental clustering algorithm, in
> order to make use of the incremental way the crawler is supposed to work
> (i.e. continuously adding and updating websites).
> >>>
> >>> Is there some way to achieve the following:
> >>>      1) initial clustering of the first web-crawl
> >>>      2) assigning new sites to existing clusters
> >>>      3) possibly moving modified sites between clusters
> >>>
> >>> I would really appreciate any help!
> >>>
> >>> Thanks,
> >>> David
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem docs using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message