Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 19065 invoked from network); 20 Feb 2011 09:46:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Feb 2011 09:46:21 -0000 Received: (qmail 4119 invoked by uid 500); 20 Feb 2011 09:46:20 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 3831 invoked by uid 500); 20 Feb 2011 09:46:18 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 3823 invoked by uid 99); 20 Feb 2011 09:46:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Feb 2011 09:46:16 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of robin.anil@gmail.com designates 74.125.82.50 as permitted sender) Received: from [74.125.82.50] (HELO mail-ww0-f50.google.com) (74.125.82.50) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Feb 2011 09:46:09 +0000 Received: by wwf26 with SMTP id 26so4861357wwf.7 for ; Sun, 20 Feb 2011 01:45:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=H0yyCNzvTCpDf10i+5O8P+lNVfOyemXTM90zpWMBcq8=; b=fo3lUyegKJFk++qvn8Vu9lXeIHwjDvMtRF7jNuYXj2IKPZKzTxyanPpp2YG7RpRFJq oSOYpNQ3+T6DkguFgcoaHhEvJ1pPS9G2K3L8yLBAPNFCiqLdvQ7aWOIzU7eldMVKAl9+ stD/EhOH2c5k82B1ktAGW9fX4qv5oinL6xwRM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=dO1iYH08faKgUY6CQVLQ/CtXGLQDZvRrnWA6YIPKjFDE8r/d9JbWKE14dFG+txM6Xl WkJC6KaY6Ejm/vfE6aHLlysaNQiMB5RzFFDD/oqNZ3OVnrBf0ULk5D+XERvpqhH296xj vh0L7t3tz35kl5KbCjY5CEoD4qtWEcNT/b12s= Received: by 10.216.174.69 with SMTP id w47mr1101127wel.41.1298195148129; Sun, 20 Feb 2011 01:45:48 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.189.76 with HTTP; Sun, 20 Feb 2011 01:45:28 -0800 (PST) In-Reply-To: References: From: Robin Anil Date: Sun, 20 Feb 2011 15:15:28 +0530 Message-ID: Subject: Re: How to get the Id List of items which belong to a cluster. To: user@mahout.apache.org Cc: Kidong Lee Content-Type: multipart/alternative; boundary=001485f1e28aa7476f049cb39ccc --001485f1e28aa7476f049cb39ccc Content-Type: text/plain; charset=UTF-8 Hi Kindong, here the key is the nearest cluster id and the value is vector. I am guessing the identifier is getting dropped somehow. Looks like a bug, can you confirm that you have created ids for the vectors you used and wrapped them in a named vector? Robin On Wed, Feb 16, 2011 at 6:31 AM, Kidong Lee wrote: > Thank you for your reply, Robin. > > I actually got the sequence file in the clusteredPoints directory like > this: > > Input Path: > /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000 > Key class: class org.apache.hadoop.io.IntWritable Value Class: class > org.apache.mahout.clustering.WeightedVectorWritable > Key: 45: Value: 1.0: [120:3.211] > Key: 35: Value: 1.0: [93:5.394, 120:3.211] > Key: 45: Value: 1.0: [120:3.211] > Key: 35: Value: 1.0: [93:5.394, 120:3.211] > ... > > Key is the cluster id, and I think, Value is not the mapping of item id, > but > the mapping of the token value in the dictionary file and if-idf weight > calculated in vectorization. > > Since I could not find a simple API in mahout to get the item ids in a > cluster, I did some works for that as follows: > > First, I wrote a hadoop M/R job to parse the vector sequence file and > produce the csv file(item-id, dic-token-value:tf-idf-weight). > Second, I also wrote a hadoop M/R job to parse the clustered points > sequence > file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight). > And in the next step, using PIG, the vector csv file and cluster csv file > could be joined by dic-token-value:tf-idf-weight and grouped by cluster-id > and item-id, and finally I got the pairs of cluster-id and item-id in the > output. > > - Kidong. > > > > > 2011/2/16 Robin Anil > > > clustering code has a paramater that enables or disables whether the > > cluster-point assignments need to be generated. If set, it will create a > > folder called clusteredPoints in the output directory having a sequence > > file > > with mappings > > > > Robin > > > > On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee wrote: > > > > > Hi, > > > > > > My situation is almost like '12.1 Finding similar users on Twitter' in > > > Mahout in action book. > > > > > > In my document, there are lists of item id and its contents seperated > by > > > delimiter comma, for example like this CSV file(itemId, itemContents): > > > 1223, sports > > > 1344, football nike > > > ... > > > > > > First I did convert this csv file to sequence file, and vectorized the > > > sequence file with SparseVectorsFromSequenceFiles. > > > With kmeans clustering, I got the clusters. Until this, all the things > > > fine. > > > > > > I wanted to get the list of items which belong to a cluster, but I have > > no > > > idea how. > > > I have printed the entries using cluster-dumper, but there is no info > > about > > > the item id. > > > > > > Any idea how to get the list of item id which belong to a cluster? > > > > > > - Kidong. > > > > > > --001485f1e28aa7476f049cb39ccc--