Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 411AC99FF for ; Mon, 3 Oct 2011 03:53:30 +0000 (UTC) Received: (qmail 84294 invoked by uid 500); 3 Oct 2011 03:53:29 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 83955 invoked by uid 500); 3 Oct 2011 03:53:28 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 83891 invoked by uid 99); 3 Oct 2011 03:53:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2011 03:53:25 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of weidezhang2007@gmail.com designates 209.85.160.170 as permitted sender) Received: from [209.85.160.170] (HELO mail-gy0-f170.google.com) (209.85.160.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2011 03:53:19 +0000 Received: by gyb11 with SMTP id 11so5444087gyb.1 for ; Sun, 02 Oct 2011 20:52:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=BoGT4HLdAfZNwfdn4J+QwLuX3QGqVvfD0U7hrdT+FFs=; b=YsxMHjUDKX0dHvs8umptdr7pbNHQ/tKVXclcdNwDrbOuxDs7yyYn2UXkO1nF3wGA/I P0ja78qAK9Z/ttDL0NMvP4vInBaE7iWcvUnV5M4NX1AUiYgyzfBNZUEDr7JXl+WyHByY MY6Ul5wTCyxFzD9CKsBQ78oQJwXwH6ZBcjgRs= MIME-Version: 1.0 Received: by 10.150.11.21 with SMTP id 21mr795432ybk.340.1317613978123; Sun, 02 Oct 2011 20:52:58 -0700 (PDT) Received: by 10.151.103.21 with HTTP; Sun, 2 Oct 2011 20:52:58 -0700 (PDT) Date: Sun, 2 Oct 2011 20:52:58 -0700 Message-ID: Subject: question about clustering From: Walter Chang To: user Content-Type: multipart/alternative; boundary=000e0cd6acc61dfde304ae5ce9f9 --000e0cd6acc61dfde304ae5ce9f9 Content-Type: text/plain; charset=ISO-8859-1 Hi , i have used mahout to produce kmeans clustering for my tf-idf result. I use the mahout command line to produce the clusters and it seems it successfully completes. $MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters -o ./kmeans-clusters -cd 1.0 -k 3 -x 1000 It seems there are two clusters directory generated.(cluster-1 and cluster-2) , when i use clusterdump on each of them, it seems to me that the clustered top terms are the same. Any idea why ? Also, how can i see which documents have been assigned to each cluster. Right now, i can see the number of documents assigned but not the complete list. Most importantly, for production purposes, i assume it makes sense for kmeans always runs on hadoop to generate the clustering file. But how do i consume these during serving ? Ideally, serving should have the doc id or query passed as a query, and the server should return the top document ranked by the score within the same cluster back. How do I do it in code ? Any good examples ? Thanks a lot, Weide --000e0cd6acc61dfde304ae5ce9f9--