Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E695F101CE for ; Thu, 12 Sep 2013 15:49:29 +0000 (UTC) Received: (qmail 55926 invoked by uid 500); 12 Sep 2013 15:49:14 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 55503 invoked by uid 500); 12 Sep 2013 15:49:14 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 52647 invoked by uid 99); 12 Sep 2013 15:48:54 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Sep 2013 15:48:54 +0000 Date: Thu, 12 Sep 2013 15:48:53 +0000 (UTC) From: "Karthik Prakhya (JIRA)" To: dev@mahout.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAHOUT-1330) Unable to do K-means clustering on Reuters dataset MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Prakhya updated MAHOUT-1330: ------------------------------------ Attachment: (was: mahout-core-0.8.jar) > Unable to do K-means clustering on Reuters dataset > -------------------------------------------------- > > Key: MAHOUT-1330 > URL: https://issues.apache.org/jira/browse/MAHOUT-1330 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.8 > Environment: Linux > Reporter: Karthik Prakhya > Fix For: 0.8 > > Attachments: df-count.txt, frequency-file.txt, MyAnalyzer.java, NewsKMeansClustering.java, NewsKMeansClustering-output.txt, reuters-seqfiles.zipx, test-kmeans-clustering-reuters-java-api.sh, tfidf-vectors.txt > > > The attached code uses the Mahout API to do k-means clustering on the Reuters dataset and generates the initial centroids using the canopy algorithm. The parameters are exactly the same as the ones in the Scala example presented in the following link: > http://sujitpal.blogspot.com/2012/09/learning-mahout-clustering.html > The code compiles without an error, but the K-means algorithm cannot initiate because the initial centroids are not being generated. This in turn is due to the fact that the TF-IDF vectors are not being generated. > Considering that this code compiles and is based on earlier Scala code that worked, it is suggestive that there is a bug in the Mahout source code that may need fixing. I thought I should bring it to your attention. > I have attached the source code, the included JAR files and the shell script (called test-kmeans-clustering-reuters-java-api.sh) to compile and run the code. The output of the shell script is located in NewsKMeansClustering-output.txt. Please note that you may need to change the path (see environmental variable JARPATH) to the JAR files in the shell script based on where you put the JARs. I also attached the output of clusterdump utility in the form of .txt files for the intermediate outputs of my code such as the TF vectors and TF-IDF vectors (see tf-vectors.txt, tfidf-vectors.txt, df-count.txt and frequency-file.txt). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira