Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 34254 invoked from network); 1 Jul 2009 13:38:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Jul 2009 13:38:09 -0000 Received: (qmail 30971 invoked by uid 500); 1 Jul 2009 13:38:19 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 30895 invoked by uid 500); 1 Jul 2009 13:38:19 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 30831 invoked by uid 99); 1 Jul 2009 13:38:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jul 2009 13:38:08 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nfantone@gmail.com designates 209.85.212.184 as permitted sender) Received: from [209.85.212.184] (HELO mail-vw0-f184.google.com) (209.85.212.184) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jul 2009 13:38:00 +0000 Received: by vwj14 with SMTP id 14so387577vwj.29 for ; Wed, 01 Jul 2009 06:37:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=HAvuBlO36xo7Esa61X5GsNXgcq5L7il2oooKEgekGUo=; b=OjCOpWOpsIQdYi2QJBm9ha3zZU7sjvIif7iWM21XgZrvWOc1tHb4EdphXDWDpFnVrU ORXstTCxXbFwUm83al+OCblyj3ThI3NrDJoor4hGrY1Y832yUQ9fbf3aXGIn2EG54WgJ H+Uho+Dbci63/7oPb2Amf7CtB1X5JpYb7EoRs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=GvjYMDsyIdh/B14bsfEoJYDlbVTGZKjsl/yc9PeQFppsX5QjRO0e6DzPTziifB7Iz6 wcpLnOM1/6CRuitTWtSFMsSoqhBQjUDWgfLc99BBNSh1Ef0hb+XKQzfykPRc0WOeolQF 4kwZgoUUM1MGAyeUCVFYJ9X4EFqm06g4BR084= MIME-Version: 1.0 Received: by 10.220.73.209 with SMTP id r17mr8929678vcj.113.1246455458626; Wed, 01 Jul 2009 06:37:38 -0700 (PDT) In-Reply-To: References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <37ffc8080906260921u7240f784g92f54fe4148c48c0@mail.gmail.com> Date: Wed, 1 Jul 2009 10:37:38 -0300 Message-ID: <37ffc8080907010637v483ec7d6k8de9e746eda69dec@mail.gmail.com> Subject: Re: Clustering from DB From: nfantone To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Ok, so I managed to write a VectorIterable implementation to draw data from my database. Now, I'm in the process of understanding the output file that kMeans (with a Canopy input) produces. Someone, please, correct me if I'm mistaken. At first, my thought was that there were as many "cluster-i" directories as clusters detected from the dataset by the algorithm(s), until I printed out the content of the "part-00000" file in them. It seems as though it stores a cluster ID and then a Cluster, each line. Are those all the actual clusters detected? If so, what's the reason behind the directory nomenclature and its consecutive enumeration? Does every "part-00000", in different "cluster-i" directories, hold different clusters? And, what about the "points" directory? I can tell it follows a register format. What's that value supposed to represent? The ID from the cluster it belongs, perhaps? There really ought to be documentation about this somewhere. I don't know if I need some kind of permission, but I'm offering myself to write it and upload it to the Mahout wiki or wherever it should be, once I finished my project. Thanks in advanced. On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen wrote: > All of Mahout is generally Hadoop/HDFS based. Taste is a bit of > exception since it has a core that is independent of Hadoop and can > use data from files, databases, etc. It also happens to have some > clustering logic. So you can use, say, TreeClusteringRecommender to > generate user clusters, based on data in a database. This isn't > Mahout's primary clustering support, but, if it fits what you need, at > least it is there. > > On Fri, Jun 26, 2009 at 12:21 PM, nfantone wrote: >> Thanks for the fast response, Grant. >> >> I am aware of what you pointed out about Taste. I just mentioned it to >> make a reference to something similar to what I needed to >> implement/use, namely the "DataModel" interface. >> >> I'm going to try the solution you suggested and write an >> implementation of VectorIterable. Expect me to come back here for >> feedback. >