mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: What are the best settings for my clustering task
Date Thu, 03 Oct 2013 02:38:14 GMT
Yes.  That will work.

The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
\approx 30 so the sketch will have at about 300,000 weighted centroids in
it.  The final clustering will have to process these centroids to produce
the desired 5,000 clusters.  Since 300,000 is a relatively small number of
data points, this clustering step should proceed relatively quickly.



On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <jbonerz@googlemail.com> wrote:

> thx for your elaborate answer.
>
> so if the upper bound on the final number of clusters is unknown in the
> beginning, what would happen, if I define a very high number that is
> guaranteed to be > the estimated number of clusters.
> for example if I set it to 10.000 clusters if an estimate of 5.000 is
> likely, will that work?
>
>
> 2013/10/2 Ted Dunning <ted.dunning@gmail.com>
>
> > The way that the new streaming k-means works is that there is a first
> > sketch pass which only requires an upper bound on the final number of
> > clusters you will want.  It adaptively creates more or less clusters
> > depending on the data and your bound.  This sketch is guaranteed to be
> > computed within at most one map-reduce pass.  There is a threaded version
> > that runs (fast) on a single machine.  The threaded version is liable to
> be
> > faster than the map-reduce version for moderate or smaller data sizes.
> >
> > That sketch can then be used to do all kinds of things that rely on
> > Euclidean distance and still get results within a small factor of the
> same
> > algorithm applied to all of the data.  Typically this second phase is a
> > ball k-means algorithm, but it could easily be a dp-means algorithm [1]
> if
> > you want a variable number of clusters.  Indeed, you could run many
> > dp-means passes with different values of lambda on the same sketch.  Note
> > that the sketch is small enough that in-memory clustering is entirely
> > viable and is very fast.
> >
> > For the problem you describe, however, you probably don't need the sketch
> > approach at all and can probably apply ball k-means or dp-means directly.
> >  Running many k-means clusterings with differing values of k should be
> > entirely feasible as well with such data sizes.
> >
> > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> >
> >
> >
> >
> > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <jbonerz@googlemail.com>
> > wrote:
> >
> > > Isn't the streaming k-means just a different approach to crunch through
> > the
> > > data? In other words, the result of streaming k-means should be
> > comparable
> > > to using k-means in multiple chained map reduce cycles?
> > >
> > > I just read a paper about the k-means clustering and its underlying
> > > algorithm.
> > >
> > > According to that paper, k-means relies on a preknown/predefined amount
> > of
> > > clusters as an input parameter.
> > >
> > > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> > >
> > > In my current scenario however, the number of clusters is unknown at
> the
> > > beginning.
> > >
> > > Maybe k-means is just not the right algorithm for clustering similar
> > > products based on their short description text? What else could I use?
> > >
> > >
> > >
> > >
> > > 2013/10/1 Ted Dunning <ted.dunning@gmail.com>
> > >
> > > > At such small sizes, I would guess that the sequential version of the
> > > > streaming k-means or ball k-means would be better options.
> > > >
> > > >
> > > >
> > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> jbonerz@googlemail.com
> > > > >wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I am currently trying create clusters from a group of 50.000
> strings
> > > that
> > > > > contain product descriptions (around 70-100 characters length
> each).
> > > > >
> > > > > That group of 50.000 consists of roughly 5.000 individual products
> > and
> > > > ten
> > > > > varying product descriptions per product. The product descriptions
> > are
> > > > > already prepared for clustering and contain a normalized brand
> name,
> > > > > product
> > > > > model number, etc.
> > > > >
> > > > > What would be a good approach to maximise the amound of found
> > clusters
> > > > (the
> > > > > best possible value would be 5.000 clusters with 10 products each)
> > > > >
> > > > > I adapted the reuters cluster script to read in my data and managed
> > to
> > > > > create a first set of clusters. However, I have not managed to
> > maximise
> > > > the
> > > > > cluster count.
> > > > >
> > > > > The question is: what do I need to tweak with regard to the
> available
> > > > > mahout
> > > > > settings, so the clusters are created as precisely as possible?
> > > > >
> > > > > Many regards!
> > > > > Jens
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > > > >
> > > >
> > >
> >
>
>
>
> --
> CEO
> Hightech Marketing Group
> Cell/Mobile: +49 173 539 3588
>
> ____
>
> Hightech Marketing Group
> Frankenstraße 32
> 50354 Huerth
> Germany
> Phone: +49 (0)2233 – 619 2741
> Fax: +49 (0)2233 – 619 27419
> Web: www.hightechmg.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message