Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@locus.apache.org Received: (qmail 97794 invoked from network); 8 May 2008 15:57:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 May 2008 15:57:06 -0000 Received: (qmail 70653 invoked by uid 500); 8 May 2008 15:57:07 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 70629 invoked by uid 500); 8 May 2008 15:57:07 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 70618 invoked by uid 99); 8 May 2008 15:57:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 May 2008 08:57:07 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of karl.wettin@gmail.com designates 66.249.92.173 as permitted sender) Received: from [66.249.92.173] (HELO ug-out-1314.google.com) (66.249.92.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 May 2008 15:56:20 +0000 Received: by ug-out-1314.google.com with SMTP id h3so110524ugf.29 for ; Thu, 08 May 2008 08:56:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; bh=ONv98+l5QUFe2/gnhP6kbx5lYdeF/p9TsYcvgyGMT5Q=; b=jGCXSCBRMUhCG1RcTEg45DaFvudlodAbd9oWDzOKrI1usxv9hq2OjPOUtPKcaXK55A0VdgX5y0z1L/bQw5TcgeD69Z1sdffc2Msom1v2DQlFIgUOomJ83vxyHu2e2AUsjUuYxb9HTAqZ0Egtr2IVMu12+N+IlMVW5jkOFs3ivco= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=ahziGL5Ne/TyXFWYKnQgDxz7XHBpWoOqtvwdXTVRK8P/c9Ha+1qN12rKVP6proNs6rOVxnrivH67U8NMCEwoYSDfjTL808ZKoyaUCq7V+quIVj/3NBUuiOtX8UuVNAFNuoLvjrbcNAXECi51j/UfX/OJ9t5LixYmpW0bJQ5uQ6o= Received: by 10.67.101.17 with SMTP id d17mr726049ugm.46.1210262192384; Thu, 08 May 2008 08:56:32 -0700 (PDT) Received: from kodapan.local ( [82.93.70.124]) by mx.google.com with ESMTPS id s7sm1181419uge.48.2008.05.08.08.56.30 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 08 May 2008 08:56:31 -0700 (PDT) Message-ID: <482322AD.8080608@gmail.com> Date: Thu, 08 May 2008 17:56:29 +0200 From: Karl Wettin User-Agent: Thunderbird 2.0.0.14 (Macintosh/20080421) MIME-Version: 1.0 To: mahout-user@lucene.apache.org Subject: Re: Clustering Demo References: <4861FF56-52FA-41B9-B02A-B73B245D8936@apache.org> In-Reply-To: <4861FF56-52FA-41B9-B02A-B73B245D8936@apache.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Grant Ingersoll skrev: > Anyone have any sample code or demo of running the clustering over a > large collection of documents that they could share? Mainly looking for > an example of taking some corpus, converting it into the appropriate > Mahout representation and then running either the k-means or the canopy > clustering on it. There is the rule based data set generation in MAHOUT-43. http://www.datasetgenerator.com Push a few buttons and you have an insane amount of OK test data according to your specifications. That is what I have been using. There is also this contact I have with these guys that produce news article data for indexing. The data is nicly organized and they have previously offered looking in to committer access to it for local tests. I have a number of data sets I'm not certain about who owns them. For instance I've been gathering real estate data for Sweden for some time as the sites I was using to find an appartment did not work the way I wanted them to :) karl