Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 646931140F for ; Mon, 14 Apr 2014 09:13:23 +0000 (UTC) Received: (qmail 58057 invoked by uid 500); 14 Apr 2014 09:13:20 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 57145 invoked by uid 500); 14 Apr 2014 09:13:18 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 57135 invoked by uid 99); 14 Apr 2014 09:13:17 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Apr 2014 09:13:17 +0000 Date: Mon, 14 Apr 2014 09:13:17 +0000 (UTC) From: "Pavan Kumar N (JIRA)" To: dev@mahout.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (MAHOUT-1450) Cleaning up clustering documentation on mahout website MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1450?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D139= 68181#comment-13968181 ]=20 Pavan Kumar N edited comment on MAHOUT-1450 at 4/14/14 9:13 AM: ---------------------------------------------------------------- [~ssc] Yes, I'd love to work on 1468. Lets take this discussion to 1468, gi= ve me an outline of topics the page should have. I am closing 1450. was (Author: pknarayan): [~ssc] Yes, I'd love to work on 1468. Lets take this discussion to 1468, gi= ve me an outline of topics the page should have. > Cleaning up clustering documentation on mahout website=20 > ------------------------------------------------------- > > Key: MAHOUT-1450 > URL: https://issues.apache.org/jira/browse/MAHOUT-1450 > Project: Mahout > Issue Type: Documentation > Components: Documentation > Environment: This affects all mahout versions > Reporter: Pavan Kumar N > Labels: documentation, newbie > Fix For: 1.0 > > > In canopy clustering, the strategy for parallelization seems to have some= dead links. Need to clean them and replace with new links (if there are an= y). Here is the link: > http://mahout.apache.org/users/clustering/canopy-clustering.html > Here are some details of the dead links for kmeans clustering page: > On the k-Means clustering - basics page,=20 > first line of the Quickstart part of the documentation, the hyperlink "He= re" > http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart= -kmeans.sh.html > first sentence of Strategy for parallelization part of documentation, the= hyperlink "Cluster computing and MapReduce", second second sentence the hy= perlink "here" and last sentence the hyperlink "http://www2.chass.ncsu.edu/= garson/PA765/cluster.htm" are dead. > http://code.google.com/edu/content/submissions/mapreduce-minilecture/list= ing.html > http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4= -clustering.ppt > http://www2.chass.ncsu.edu/garson/PA765/cluster.htm > Under the page: http://mahout.apache.org/users/clustering/visualizing-sam= ple-clusters.html > in the second sentence of Pre-prep part of this page, the hyperlink "setu= p mahout" is dead. > http://mahout.apache.org/users/clustering/users/basics/quickstart.html > The existing documentation is too ambiguous and I recommend to make the f= ollowing changes so the new users can use it as tutorial. > The Quickstart should be replaced with the following: > Get the data from: > wget http://www.daviddlewis.com/resources/testcollections/reuters21578/re= uters21578.tar.gz > Place it within the example folder from mahout home director: > mahout-0.7/examples/reuters > mkdir reuters > cd reuters > mkdir reuters-out > mv reuters21578.tar.gz reuters-out > cd reuters-out > tar -xzvf reuters21578.tar.gz > cd .. > Mahout specific Commands > #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class > ${MAHOUT_HOME}/bin/mahout > org.apache.lucene.benchmark.utils.ExtractReuters reuters-out > reuters-text > #2 copy the file to your HDFS > bin/hadoop fs -copyFromLocal > /home/bigdata/mahout-distribution-0.7/examples/reuters-text > hdfs://localhost:54310/user/bigdata/ > #3 generate sequence-file > mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text > -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5 > -chunk =E2=86=92 specifying the number of data blocks > UTF-8 =E2=86=92 specifying the appropriate input format > #4 Check the generated sequence-file > mahout-0.7$ ./bin/mahout seqdumper -i > /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less > #5 From sequence-file generate vector file > mahout seq2sparse -i > hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o > hdfs://localhost:54310/user/bigdata/reuters-vectors -ow > -ow =E2=86=92 overwrite > #6 take a look at it should have 7 items by using this command > bin/hadoop fs -ls > reuters-vectors/df-count > reuters-vectors/dictionary.file-0 > reuters-vectors/frequency.file-0 > reuters-vectors/tf-vectors > reuters-vectors/tfidf-vectors > reuters-vectors/tokenized-documents > reuters-vectors/wordcount > bin/hadoop fs -ls reuters-vectors > #7 check the vector: reuters-vectors/tf-vectors/part-r-00000 > mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors > #8 Run canopy clustering to get optimal initial centroids for k-means > mahout canopy -i > hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o > hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm > org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000 > -dm =E2=86=92 specifying the distance measure to be used while clustering= (here it is cosine distance measure) > #9 Run k-means clustering algorithm > mahout kmeans -i > hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c > hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o > hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow > -x 20 -k 10 > -i =E2=86=92 input > -o =E2=86=92 output > -c =E2=86=92 initial centroids for k-means (not defining this parameter w= ill > trigger k-means to generate random initial centroids) > -cd =E2=86=92 convergence delta parameter > -ow =E2=86=92 overwrite > -x =E2=86=92 specifying number of k-means iterations > -k =E2=86=92 specifying number of clusters > #10 Export k-means output using Cluster Dumper tool > mahout clusterdump -dt sequencefile -d hdfs://localhost:54310/user/bigdat= a/reuters-vectors/dictionary.file-* > -i hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters/clusters-8= - > final -o clusters.txt -b 15 > -dt =E2=86=92 dictionary type > -b =E2=86=92 specifying length of each word > Mahout 0.7 version did have some problems using the DisplayKmeans module = which should ideally display the clusters in a 2d graph. But it gave me the= same output for different input datasets. I was using dataset of recent ne= ws items that was crawled from various websites. -- This message was sent by Atlassian JIRA (v6.2#6252)