Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 25995 invoked from network); 2 Jul 2009 14:33:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Jul 2009 14:33:41 -0000 Received: (qmail 17533 invoked by uid 500); 2 Jul 2009 14:33:51 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 17471 invoked by uid 500); 2 Jul 2009 14:33:51 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 17461 invoked by uid 99); 2 Jul 2009 14:33:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jul 2009 14:33:51 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nfantone@gmail.com designates 209.85.212.184 as permitted sender) Received: from [209.85.212.184] (HELO mail-vw0-f184.google.com) (209.85.212.184) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jul 2009 14:33:41 +0000 Received: by vwj14 with SMTP id 14so918951vwj.29 for ; Thu, 02 Jul 2009 07:33:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=xwxRwfDRAUXszpkfMzATXxoS9txMB+gWRMRf8AQ4TQ4=; b=GEjoaiD6sG68EtJw3u60m9ZsOlgfptMNeusQpX9t8zqrMd8d7E6J6xiPKEdBpG8qfA IWUens75JlmZRxuaJ7k50jIpXPrIrvrQ4HxfmS+nESY96vRuOvrkkzgz2Sm1VABjPej/ NOLFuHfLqZ7tyIf2THHGe04tUOqP1FBrfsZrw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Q5GLNJj+v5UZU2jUEK/u04GG6He0JpMMfj8Xo9tqYvwGyVVXbewTiPokkmMkAYc8I9 7P3sK/3OnPKaOWVpQsEs/UmRYRhs4IevaxFmxp9d+sWg0AGvHvM0e1+qUlWuOofulBEx 9m8iYMHuyPeQsirg4PFQBdiyNbNTFu6nSUWPg= MIME-Version: 1.0 Received: by 10.220.46.10 with SMTP id h10mr272824vcf.8.1246545200805; Thu, 02 Jul 2009 07:33:20 -0700 (PDT) In-Reply-To: <4B080410-D4B0-49A2-A73A-5A04B0E286A1@apache.org> References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <37ffc8080906260921u7240f784g92f54fe4148c48c0@mail.gmail.com> <37ffc8080907010637v483ec7d6k8de9e746eda69dec@mail.gmail.com> <4B080410-D4B0-49A2-A73A-5A04B0E286A1@apache.org> Date: Thu, 2 Jul 2009 11:33:20 -0300 Message-ID: <37ffc8080907020733m19eacd5fkb368dc44068da29a@mail.gmail.com> Subject: Re: Clustering from DB From: nfantone To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org After some research and testing, I believe I can throw some light on the subject. The runJob() static method defined in KMeansDriver expects three file paths, referencing three different files with different logical record's format; moreover, a "points" directory, along with other files, are created as part of the output: 1) input Description: A file containing data to be clustered, represented by Vectors= . Path: An absolute path to an HDFS data file. Example: "input/thedata.dat" Logical format: . The ID could be anything as long as it extends Writable. Code example (writing an input file): // Get FileSystem through Configuration Configuration conf =3D new Configuration(); Filesystem fs =3D FileSystem.get(conf); // Instantiate writer to input data in a .dat file // with a logical format String fileName =3D "input/thedata.dat"; Path path =3D new Path(fileName); SequenceFile.Writer seqVectorWriter =3D new SequenceFile.Writer(fs, conf, path, Text.class, SparseVector.class); VectorWriter writer =3D new SequenceFileVectorWriter(seqVectorWriter); // Write Vectors to file. inputVectors could be any VectorIterable implementation. writer.write(inputVectors); writer.close(); 2) clustersIn Description: A file containing the initial pre-computed (or randomly selected) clusters to be used by kMeans. The 'k' value is determined by the number of clusters in THIS file. Path: An absolute path to a DIRECTORY containing any number of files with a "part-xxxxx" name format, where 'x' is a one digit number. The name should be omitted from the path. Example: "input/initial", where 'initial' has a "part-00000" file stored in it. Logical format: . The ID could be anything as long as it extends Writable. Code example (writing a clustersIn file): // Get FileSystem through Configuration Configuration conf =3D new Configuration(); Filesystem fs =3D FileSystem.get(conf); // Instantiate writer to input clusters in a file with a logical format String fileName =3D "input/initial/part-00000"; Path path =3D new Path(fileName); SequenceFile.Writer seqClusterWriter =3D new SequenceFile.Writer(fs, conf, path Text.class, Cluster.class); // We choose 'k' random Vectors as centers for the initial clusters. // 'inputVectors' could be any VectorIterable implementation. // CANT_INITIAL_CLUSTERS is a desired integer value . // The identifier of a Cluster is used as its ID. // AFAICT, you DO NOT need to add the center as an actual point in the clus= ter, // after cluster creation. This has been corrected recently. int k =3D 0; Iterator it =3D inputVectors.iterator(); while (it.hasNext() && k++ < CANT_INITIAL_CLUSTERS) { Vector v =3D (Vector)it.next(); Cluster c =3D new Cluster(v); seqClusterWriter.append(new Text(c.getIdentifier()), c); } seqClusterWriter.close(); 3) output Description: The output files generated by the algorithm, in which the results are stored. Directories named "clusters-i" -'i' being a positive integer- are created. I'm not quite certain, but I believe its nomenclature comes from the number of MAP/REDUCED tasks involved. "part-00000" files are placed in those directories - they hold records logically structured as , each of which represent a determined cluster in the dataset. Path: An absolute path to a parent directory for the "clusters-i" directories. Example: "output". Code example (reading and printing an output file): // Get FileSystem through Configutaion Configuration conf =3D new Configuration(); Filesystem fs =3D FileSystem.get(conf); // Create a reader for a 'part-00000' file Path outPath =3D new Path("output/clusters-0/part-00000"); SequenceFile.Reader reader =3D new SequenceFile.Reader(fs, outPath, conf); Writable key =3D (Writable) reader.getKeyClass().newInstance(); Cluster value =3D new Cluster(); Vector center =3D null; // Read file's records and print each cluster as 'Cluster: key {center}' while (reader.next(key, value)) { System.out.println("Cluster: " + key + " { "); center =3D value.getCenter(); for (int i =3D 0; i < center.size(); i++) { System.out.print(center.get(i) + " "); } System.out.println(" }"); 4) points Description: A directory containing a "part-00000" file with a (both being Text type fields). It's basically an index (with VectorID as key) that matches every Vector described in the input ("thedata.dat" in our example) with the cluster they now belong. Logical format: . VectorID matches the ID specified by the first field of each record int the input file. ClusterID matches the ID in the first field of each "part-xxxxx" included in a "clusters-i" directory. That's that, for now. Surely, this is not error-proof and should be revised and improved, but it could very well serve as a start for a documentation page. Try and catch sentences were omitted for code's clarity sake. Comments, suggestions and corrections are, obviously, welcomed. Description: On Thu, Jul 2, 2009 at 12:32 AM, Grant Ingersoll wrote= : > > On Jul 1, 2009, at 9:37 AM, nfantone wrote: > >> Ok, so I managed to write a VectorIterable implementation to draw data >> from my database. Now, I'm in the process of understanding the output >> file that kMeans (with a Canopy input) produces. Someone, please, >> correct me if I'm mistaken. At first, my thought was that there were >> as many "cluster-i" directories as clusters detected from the dataset >> by the algorithm(s), until I printed out the content of the >> "part-00000" file in them. It seems as though it stores a >> cluster ID and then a Cluster, each line. Are those all the >> actual clusters detected? If so, what's the reason behind the >> directory nomenclature and its consecutive enumeration? > > I was wondering the same thing myself. =A0I believe it has to do with the > number of iterations or reduce tasks, but I haven't looked closely at the > code yet. =A0Maybe Jeff can jump in here. > > >> Does every >> "part-00000", in different "cluster-i" directories, hold different >> clusters? And, what about the "points" directory? I can tell it >> follows a register format. What's that value >> supposed to represent? The ID from the cluster it belongs, perhaps? > > I believe this is the case. > >> >> There really ought to be documentation about this somewhere. I don't >> know if I need some kind of permission, but I'm offering myself to >> write it and upload it to the Mahout wiki or wherever it should be, >> once I finished my project. >> > > +1 > >> Thanks in advanced. >> >> On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen wrote: >>> >>> All of Mahout is generally Hadoop/HDFS based. Taste is a bit of >>> exception since it has a core that is independent of Hadoop and can >>> use data from files, databases, etc. It also happens to have some >>> clustering logic. So you can use, say, TreeClusteringRecommender to >>> generate user clusters, based on data in a database. This isn't >>> Mahout's primary clustering support, but, if it fits what you need, at >>> least it is there. >>> >>> On Fri, Jun 26, 2009 at 12:21 PM, nfantone wrote: >>>> >>>> Thanks for the fast response, Grant. >>>> >>>> I am aware of what you pointed out about Taste. I just mentioned it to >>>> make a reference to something similar to what I needed to >>>> implement/use, namely the "DataModel" interface. >>>> >>>> I'm going to try the solution you suggested and write an >>>> implementation of VectorIterable. Expect me to come back here for >>>> feedback. >>> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >