Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@locus.apache.org Received: (qmail 81483 invoked from network); 8 May 2008 15:28:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 May 2008 15:28:36 -0000 Received: (qmail 87405 invoked by uid 500); 8 May 2008 15:28:38 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 87389 invoked by uid 500); 8 May 2008 15:28:38 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 87378 invoked by uid 99); 8 May 2008 15:28:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 May 2008 08:28:37 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [69.44.16.11] (HELO getopt.org) (69.44.16.11) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 May 2008 15:27:50 +0000 Received: from [172.20.214.4] (resortcountry.com [216.94.152.62] (may be forged)) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id m48FSAk08456 for ; Thu, 8 May 2008 10:28:11 -0500 Message-ID: <48231BE2.6020208@getopt.org> Date: Thu, 08 May 2008 17:27:30 +0200 From: Andrzej Bialecki User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: mahout-user@lucene.apache.org Subject: Re: Clustering Demo References: <4861FF56-52FA-41B9-B02A-B73B245D8936@apache.org> In-Reply-To: <4861FF56-52FA-41B9-B02A-B73B245D8936@apache.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Grant Ingersoll wrote: > Anyone have any sample code or demo of running the clustering over a > large collection of documents that they could share? Mainly looking for > an example of taking some corpus, converting it into the appropriate > Mahout representation and then running either the k-means or the canopy > clustering on it. It would be way cool to do this with the industry standard 20 newsgroups corpus - there have been many experiments and evaluations of this corpus, so it's good as a baseline. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com