Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1081)
Subject: Re: Updating clusters
From: Grant Ingersoll <gsingers@apache.org>
In-Reply-To: <AANLkTimTd1B8soL584ctlKkjzK5B9Ix5MwvdFPSnqSr4@mail.gmail.com>
Date: Fri, 16 Jul 2010 12:06:42 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <381EE0FA-899E-44F6-A36A-CE28BC30323B@apache.org>
References: <AANLkTimTd1B8soL584ctlKkjzK5B9Ix5MwvdFPSnqSr4@mail.gmail.com>
To: user@mahout.apache.org


On Jul 16, 2010, at 11:27 AM, Asif Rahman wrote:

> Can anyone provide some advice on how to update an existing clustering =
with
> new data points.  Our data set is approximately 1mm newspaper =
headlines over
> the course of a month.  I'm able to get a high quality clustering =
using the
> existing mahout tasks (I'm just using canopy in this instance)

[OT] Care to share more (since you've already said you are using it)?  =
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

> but I'd like
> to update the clusters on an hourly basis.  Given the hardware that is
> available to me, I won't be able to run the clustering to completion =
over
> the entire data set every hour.  Are there any methods for completing =
such a
> task?

How many new docs are you talking in that hour?  I'm sure others can add =
here, but AIUI, people in this situation often calculate the clusters =
and then for new docs in some time period, they just see which cluster =
that new document is closest to and add it there, then, offline or =
"later" they recluster the whole set.  So, for instance, perhaps nightly =
or every 6 hours or whatever you can afford, you do the whole job, but =
then in between you just do the lighter weight calculation.  I imagine =
there are probably ways of calculating when a new cluster is needed or =
when quality has dropped too much, so perhaps that could be used to =
trigger a new full run, too.

>=20
> Since I'm not a mahout or linear algebra expert at this point, ideally =
the
> solution would involve a combination of the existing mahout tasks.  =
That
> being said, I'd be appreciative of any and all advice.
>=20
> Thanks,
>=20
> Asif
>=20
>=20
> --=20
> Asif Rahman
> Lead Engineer - NewsCred
> asif@newscred.com
> http://platform.newscred.com