Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of nfantone@gmail.com designates
 209.85.212.184 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=Q5GLNJj+v5UZU2jUEK/u04GG6He0JpMMfj8Xo9tqYvwGyVVXbewTiPokkmMkAYc8I9
         7P3sK/3OnPKaOWVpQsEs/UmRYRhs4IevaxFmxp9d+sWg0AGvHvM0e1+qUlWuOofulBEx
         9m8iYMHuyPeQsirg4PFQBdiyNbNTFu6nSUWPg=
MIME-Version: 1.0
In-Reply-To: <4B080410-D4B0-49A2-A73A-5A04B0E286A1@apache.org>
References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com>
	 <FF29A81E-A0A3-42E4-A54A-9986B9ADBEFF@apache.org>
	 <37ffc8080906260921u7240f784g92f54fe4148c48c0@mail.gmail.com>
	 <e2e029610906260954i7bb7f302sebfbeaa73cccdd5e@mail.gmail.com>
	 <37ffc8080907010637v483ec7d6k8de9e746eda69dec@mail.gmail.com>
	 <4B080410-D4B0-49A2-A73A-5A04B0E286A1@apache.org>
Date: Thu, 2 Jul 2009 11:33:20 -0300
Message-ID: <37ffc8080907020733m19eacd5fkb368dc44068da29a@mail.gmail.com>
Subject: Re: Clustering from DB
From: nfantone <nfantone@gmail.com>
To: mahout-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

After some research and testing, I believe I can throw some light on
the subject. The runJob() static method defined in KMeansDriver
expects three file paths, referencing three different files with
different logical record's format; moreover, a "points" directory,
along with other files, are created as part of the output:

1) input

Description: A file containing data to be clustered, represented by Vectors=
.
Path: An absolute path to an HDFS data file.  Example: "input/thedata.dat"
Logical format: <ID, Vector>. The ID could be anything as long as it
extends Writable.
Code example (writing an input file):

// Get FileSystem through Configuration
Configuration conf =3D new Configuration();
Filesystem fs =3D FileSystem.get(conf);

// Instantiate writer to input data in a .dat file
// with a <Text, SparseVector> logical format
String fileName =3D "input/thedata.dat";
Path path =3D new Path(fileName);

SequenceFile.Writer seqVectorWriter =3D new SequenceFile.Writer(fs,
conf, path, Text.class, SparseVector.class);
VectorWriter writer =3D new SequenceFileVectorWriter(seqVectorWriter);

// Write Vectors to file. inputVectors could be any VectorIterable
implementation.
writer.write(inputVectors);
writer.close();

2) clustersIn

Description: A file containing the initial pre-computed (or randomly
selected) clusters to be used by kMeans. The 'k' value is determined
by the number of clusters in THIS file.
Path: An absolute path to a DIRECTORY containing any number of files
with a "part-xxxxx" name format, where 'x' is a one digit number. The
name should be omitted from the path. Example: "input/initial", where
'initial' has a "part-00000" file stored in it.
Logical format: <ID, ClusterBase>. The ID could be anything as long as
it extends Writable.
Code example (writing a clustersIn file):

// Get FileSystem through Configuration
Configuration conf =3D new Configuration();
Filesystem fs =3D FileSystem.get(conf);

// Instantiate writer to input clusters in a file with a <Text,
Cluster> logical format
String fileName =3D "input/initial/part-00000";
Path path =3D new Path(fileName);

SequenceFile.Writer seqClusterWriter =3D new SequenceFile.Writer(fs,
conf, path Text.class, Cluster.class);

// We choose 'k' random Vectors as centers for the initial clusters.
// 'inputVectors' could be any VectorIterable implementation.
// CANT_INITIAL_CLUSTERS is a desired integer value .
// The identifier of a Cluster is used as its ID.
// AFAICT, you DO NOT need to add the center as an actual point in the clus=
ter,
// after cluster creation. This has been corrected recently.
int k =3D 0;
Iterator it =3D inputVectors.iterator();
while (it.hasNext() && k++ < CANT_INITIAL_CLUSTERS) {
	Vector v =3D (Vector)it.next();
	Cluster c =3D new Cluster(v);
	seqClusterWriter.append(new Text(c.getIdentifier()), c);
}
seqClusterWriter.close();

3) output

Description: The output files generated by the algorithm, in which the
results are stored. Directories named "clusters-i" -'i' being a
positive integer- are created. I'm not quite certain, but I believe
its nomenclature comes from the number of MAP/REDUCED tasks involved.
"part-00000" files are placed in those directories - they hold records
logically structured as <Text, Cluster>, each of which represent a
determined cluster in the dataset.
Path: An absolute path to a parent directory for the "clusters-i"
directories. Example: "output".
Code example (reading and printing an output file):

// Get FileSystem through Configutaion
Configuration conf =3D new Configuration();
Filesystem fs =3D FileSystem.get(conf);

// Create a reader for a 'part-00000' file
Path outPath =3D new Path("output/clusters-0/part-00000");
SequenceFile.Reader reader  =3D new SequenceFile.Reader(fs, outPath, conf);

Writable key =3D  (Writable) reader.getKeyClass().newInstance();
Cluster value =3D new Cluster();
Vector center =3D null;

// Read file's records and print each cluster as 'Cluster: key {center}'
while (reader.next(key, value)) {
	System.out.println("Cluster: " + key + " { ");
	center =3D value.getCenter();

	for (int i =3D 0; i < center.size(); i++) {
		System.out.print(center.get(i) + " ");
	}
System.out.println(" }");

4) points

Description: A directory containing a "part-00000" file with a
<VectorID, CusterID> (both being Text type fields). It's basically an
index (with VectorID as key) that matches every Vector described in
the input ("thedata.dat" in our example) with the cluster they now
belong.
Logical format: <VectorID, ClusterID>. VectorID matches the ID
specified by the first field of each record int the input file.
ClusterID matches the ID in the first field of each "part-xxxxx"
included in a "clusters-i" directory.

That's that, for now. Surely, this is not error-proof and should be
revised and improved, but it could very well serve as a start for a
documentation page. Try and catch sentences were omitted for code's
clarity sake. Comments, suggestions and corrections are, obviously,
welcomed.

Description:
On Thu, Jul 2, 2009 at 12:32 AM, Grant Ingersoll<gsingers@apache.org> wrote=
:
>
> On Jul 1, 2009, at 9:37 AM, nfantone wrote:
>
>> Ok, so I managed to write a VectorIterable implementation to draw data
>> from my database. Now, I'm in the process of understanding the output
>> file that kMeans (with a Canopy input) produces. Someone, please,
>> correct me if I'm mistaken. At first, my thought was that there were
>> as many "cluster-i" directories as clusters detected from the dataset
>> by the algorithm(s), until I printed out the content of the
>> "part-00000" file in them. It seems as though it stores a <Writable>
>> cluster ID and then a <Writable> Cluster, each line. Are those all the
>> actual clusters detected? If so, what's the reason behind the
>> directory nomenclature and its consecutive enumeration?
>
> I was wondering the same thing myself. =A0I believe it has to do with the
> number of iterations or reduce tasks, but I haven't looked closely at the
> code yet. =A0Maybe Jeff can jump in here.
>
>
>> Does every
>> "part-00000", in different "cluster-i" directories, hold different
>> clusters? And, what about the "points" directory? I can tell it
>> follows a <VectorID, Value> register format. What's that value
>> supposed to represent? The ID from the cluster it belongs, perhaps?
>
> I believe this is the case.
>
>>
>> There really ought to be documentation about this somewhere. I don't
>> know if I need some kind of permission, but I'm offering myself to
>> write it and upload it to the Mahout wiki or wherever it should be,
>> once I finished my project.
>>
>
> +1
>
>> Thanks in advanced.
>>
>> On Fri, Jun 26, 2009 at 1:54 PM, Sean Owen<srowen@gmail.com> wrote:
>>>
>>> All of Mahout is generally Hadoop/HDFS based. Taste is a bit of
>>> exception since it has a core that is independent of Hadoop and can
>>> use data from files, databases, etc. It also happens to have some
>>> clustering logic. So you can use, say, TreeClusteringRecommender to
>>> generate user clusters, based on data in a database. This isn't
>>> Mahout's primary clustering support, but, if it fits what you need, at
>>> least it is there.
>>>
>>> On Fri, Jun 26, 2009 at 12:21 PM, nfantone<nfantone@gmail.com> wrote:
>>>>
>>>> Thanks for the fast response, Grant.
>>>>
>>>> I am aware of what you pointed out about Taste. I just mentioned it to
>>>> make a reference to something similar to what I needed to
>>>> implement/use, namely the "DataModel" interface.
>>>>
>>>> I'm going to try the solution you suggested and write an
>>>> implementation of VectorIterable. Expect me to come back here for
>>>> feedback.
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>