Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of robin.anil@gmail.com
 designates 74.125.82.50 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        b=dO1iYH08faKgUY6CQVLQ/CtXGLQDZvRrnWA6YIPKjFDE8r/d9JbWKE14dFG+txM6Xl
         WkJC6KaY6Ejm/vfE6aHLlysaNQiMB5RzFFDD/oqNZ3OVnrBf0ULk5D+XERvpqhH296xj
         vh0L7t3tz35kl5KbCjY5CEoD4qtWEcNT/b12s=
MIME-Version: 1.0
In-Reply-To: <AANLkTi=w8w8r0yiJcRWyB9Sxuz2+G6ZX1fbuNy04O3RO@mail.gmail.com>
References: <AANLkTikVjy8Ny3nGgJ=g0v7Dtr=9rtnpLEPh-BE8xxAA@mail.gmail.com>
 <AANLkTimd-bj1y9s2vUamX14_U3jwDGEyrvV-GefOEJp9@mail.gmail.com>
 <AANLkTi=w8w8r0yiJcRWyB9Sxuz2+G6ZX1fbuNy04O3RO@mail.gmail.com>
From: Robin Anil <robin.anil@gmail.com>
Date: Sun, 20 Feb 2011 15:15:28 +0530
Message-ID: <AANLkTi=ygPuKmtc2v4KPNny0+h-e5h0YOTs9=Onnt7Kv@mail.gmail.com>
Subject: Re: How to get the Id List of items which belong to a cluster.
To: user@mahout.apache.org
Cc: Kidong Lee <mykidong@gmail.com>
Content-Type: multipart/alternative; boundary=001485f1e28aa7476f049cb39ccc

--001485f1e28aa7476f049cb39ccc
Content-Type: text/plain; charset=UTF-8

Hi Kindong, here the key is the nearest cluster id and the value is vector.
I am guessing the identifier is getting dropped somehow. Looks like a bug,
can you confirm that you have created ids for the vectors you used and
wrapped them in a named vector?


Robin

On Wed, Feb 16, 2011 at 6:31 AM, Kidong Lee <mykidong@gmail.com> wrote:

> Thank you for your reply, Robin.
>
> I actually got the sequence file in the clusteredPoints directory like
> this:
>
> Input Path:
> /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.WeightedVectorWritable
> Key: 45: Value: 1.0: [120:3.211]
> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
> Key: 45: Value: 1.0: [120:3.211]
> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
> ...
>
> Key is the cluster id, and I think, Value is not the mapping of item id,
> but
> the mapping of the token value in the dictionary file and if-idf weight
> calculated in vectorization.
>
> Since I could not find a simple API in mahout to get the item ids in a
> cluster, I did some works for that as follows:
>
> First, I wrote a hadoop M/R job to parse the vector sequence file and
> produce the csv file(item-id, dic-token-value:tf-idf-weight).
> Second, I also wrote a hadoop M/R job to parse the clustered points
> sequence
> file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight).
> And in the next step, using PIG, the vector csv file and cluster csv file
> could be joined by dic-token-value:tf-idf-weight and grouped by cluster-id
> and item-id, and finally I got the pairs of cluster-id and item-id in the
> output.
>
> - Kidong.
>
>
>
>
> 2011/2/16 Robin Anil <robin.anil@gmail.com>
>
> > clustering code has a paramater that enables or disables whether the
> > cluster-point assignments need to be generated. If set, it will create a
> > folder called clusteredPoints in the output directory having a sequence
> > file
> > with mappings
> >
> > Robin
> >
> > On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <mykidong@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > My situation is almost like '12.1 Finding similar users on Twitter' in
> > > Mahout in action book.
> > >
> > > In my document, there are lists of item id and its contents seperated
> by
> > > delimiter comma, for example like this CSV file(itemId, itemContents):
> > > 1223, sports
> > > 1344, football nike
> > > ...
> > >
> > > First I did convert this csv file to sequence file, and vectorized the
> > > sequence file with SparseVectorsFromSequenceFiles.
> > > With kmeans clustering, I got the clusters. Until this, all the things
> > > fine.
> > >
> > > I wanted to get the list of items which belong to a cluster, but I have
> > no
> > > idea how.
> > > I have printed the entries using cluster-dumper, but there is no info
> > about
> > > the item id.
> > >
> > > Any idea how to get the list of item id which belong to a cluster?
> > >
> > > - Kidong.
> > >
> >
>

--001485f1e28aa7476f049cb39ccc--