Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 6525 invoked from network); 16 Jul 2010 16:07:10 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Jul 2010 16:07:10 -0000 Received: (qmail 99505 invoked by uid 500); 16 Jul 2010 16:07:10 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 99417 invoked by uid 500); 16 Jul 2010 16:07:09 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 99407 invoked by uid 99); 16 Jul 2010 16:07:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Jul 2010 16:07:09 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.9] (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 16 Jul 2010 16:07:06 +0000 Received: (qmail 6472 invoked by uid 99); 16 Jul 2010 16:06:44 -0000 Received: from localhost.apache.org (HELO [192.168.1.117]) (127.0.0.1) (smtp-auth username gsingers, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Jul 2010 16:06:44 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: Updating clusters From: Grant Ingersoll In-Reply-To: Date: Fri, 16 Jul 2010 12:06:42 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <381EE0FA-899E-44F6-A36A-CE28BC30323B@apache.org> References: To: user@mahout.apache.org X-Mailer: Apple Mail (2.1081) X-Virus-Checked: Checked by ClamAV on apache.org On Jul 16, 2010, at 11:27 AM, Asif Rahman wrote: > Can anyone provide some advice on how to update an existing clustering = with > new data points. Our data set is approximately 1mm newspaper = headlines over > the course of a month. I'm able to get a high quality clustering = using the > existing mahout tasks (I'm just using canopy in this instance) [OT] Care to share more (since you've already said you are using it)? = https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout > but I'd like > to update the clusters on an hourly basis. Given the hardware that is > available to me, I won't be able to run the clustering to completion = over > the entire data set every hour. Are there any methods for completing = such a > task? How many new docs are you talking in that hour? I'm sure others can add = here, but AIUI, people in this situation often calculate the clusters = and then for new docs in some time period, they just see which cluster = that new document is closest to and add it there, then, offline or = "later" they recluster the whole set. So, for instance, perhaps nightly = or every 6 hours or whatever you can afford, you do the whole job, but = then in between you just do the lighter weight calculation. I imagine = there are probably ways of calculating when a new cluster is needed or = when quality has dropped too much, so perhaps that could be used to = trigger a new full run, too. >=20 > Since I'm not a mahout or linear algebra expert at this point, ideally = the > solution would involve a combination of the existing mahout tasks. = That > being said, I'd be appreciative of any and all advice. >=20 > Thanks, >=20 > Asif >=20 >=20 > --=20 > Asif Rahman > Lead Engineer - NewsCred > asif@newscred.com > http://platform.newscred.com