mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Prakhya (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1330) Unable to do K-means clustering on Reuters dataset
Date Tue, 10 Sep 2013 19:05:51 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763362#comment-13763362
] 

Karthik Prakhya edited comment on MAHOUT-1330 at 9/10/13 7:05 PM:
------------------------------------------------------------------

These are the attachments mentioned in my previous comment. I am having difficulty uploading
tf-vectors.txt and the mahout-core-0.8-job.jar because they are each greater than 10 MB in
size.
                
      was (Author: kprakhya):
    These are the attachments mentioned in my previous comment. I am having difficulty uploading
tf-vectors.txt and the mahout-core-job-0.8.jar because they are each greater than 10 MB in
size.
                  
> Unable to do K-means clustering on Reuters dataset
> --------------------------------------------------
>
>                 Key: MAHOUT-1330
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1330
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>         Environment: Linux
>            Reporter: Karthik Prakhya
>             Fix For: 0.8
>
>         Attachments: df-count.txt, frequency-file.txt, hadoop-core-1.1.2.jar, lucene-analyzers-common-4.3.0.jar,
lucene-core-4.3.0.jar, mahout-core-0.8.jar, mahout-integration-0.8.jar, mahout-math-0.8.jar,
MyAnalyzer.java, NewsKMeansClustering.java, NewsKMeansClustering-output.txt, test-kmeans-clustering-reuters-java-api.sh,
tfidf-vectors.txt
>
>
> The attached code uses the Mahout API to do k-means clustering on the Reuters dataset
and generates the initial centroids using the canopy algorithm. The parameters are exactly
the same as the ones in the Scala example presented in the following link:
> http://sujitpal.blogspot.com/2012/09/learning-mahout-clustering.html
> The code compiles without an error, but the K-means algorithm cannot initiate because
the initial centroids are not being generated. This in turn is due to the fact that the TF-IDF
vectors are not being generated.
> Considering that this code compiles and is based on earlier Scala code that worked, it
is suggestive that there is a bug in the Mahout source code that may need fixing. I thought
I should bring it to your attention.
> I have attached the source code, the included JAR files in a zip folder and the shell
script (called test-kmeans-clustering-reuters-java-api.sh) to compile and run the code. The
output of the shell script is located in NewsKMeansClustering-output.txt. Please note that
you may need to change the path (see environmental variable JARPATH) to the JAR files in the
shell script based on where you put the JARs. I also attached the output of clusterdump utility
in the form of .txt files for the intermediate outputs of my code such as the TF vectors and
TF-IDF vectors (see tf-vectors.txt, tfidf-vectors.txt, df-count.txt and frequency-file.txt).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message