hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin Illecker (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HAMA-834) Fix KMeans example
Date Mon, 06 Jan 2014 13:15:55 GMT

     [ https://issues.apache.org/jira/browse/HAMA-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Martin Illecker updated HAMA-834:
---------------------------------

    Attachment: HAMA-834_v03.patch

I have tested the KMeans example running in Pseudo Distributed Mode using text input:
{code}
% hadoop fs -rmr /tmp/kmeans

% hama jar hama-examples-0.7.0-SNAPSHOT.jar kmeans /tmp/kmeans/in/input.txt /tmp/kmeans/out
10 1
Cannot read text input file: /tmp/kmeans/in/input.txt

% echo -e "vec1\t0\t0\nvec2\t1\t1\nvec3\t2\t2\nvec4\t3\t3\nvec5\t4\t4\n\
vec6\t5\t5\nvec7\t6\t6\nvec8\t7\t7\nvec9\t8\t8\nvec10\t9\t9\nvec11\t10\t10" > input.txt
\
&& hadoop fs -put input.txt /tmp/kmeans/in/input.txt \
&& rm input.txt

% hama jar hama-examples-0.7.0-SNAPSHOT.jar kmeans /tmp/kmeans/in/input.txt /tmp/kmeans/out
10 1

% hama seqdumper -file /tmp/kmeans/in/textinput/in.seq
Input Path: /tmp/kmeans/in/textinput/in.seq
Key class: class org.apache.hama.commons.io.VectorWritable Value Class: class org.apache.hadoop.io.NullWritable
Key: vec1: [0.0, 0.0]: Value: (null)
Key: vec2: [1.0, 1.0]: Value: (null)
Key: vec3: [2.0, 2.0]: Value: (null)
Key: vec4: [3.0, 3.0]: Value: (null)
Key: vec5: [4.0, 4.0]: Value: (null)
Key: vec6: [5.0, 5.0]: Value: (null)
Key: vec7: [6.0, 6.0]: Value: (null)
Key: vec8: [7.0, 7.0]: Value: (null)
Key: vec9: [8.0, 8.0]: Value: (null)
Key: vec10: [9.0, 9.0]: Value: (null)
Key: vec11: [10.0, 10.0]: Value: (null)
Count: 11

% hama seqdumper -file /tmp/kmeans/out/center/center_output.seq
Input Path: /tmp/kmeans/out/center/center_output.seq
Key class: class org.apache.hama.commons.io.VectorWritable Value Class: class org.apache.hadoop.io.NullWritable
Key: [5.0, 5.0]: Value: (null)
Count: 1
{code}

Using the random input generator:
{code}
% hadoop fs -rmr /tmp/kmeans

% hama jar hama-examples-0.7.0-SNAPSHOT.jar kmeans /tmp/kmeans/in /tmp/kmeans/out 10 1 -g
1000 3

% hama seqdumper -file /tmp/kmeans/out/center/center_output.seq
Input Path: /tmp/kmeans/out/center/center_output.seq
Key class: class org.apache.hama.commons.io.VectorWritable Value Class: class org.apache.hadoop.io.NullWritable
Key: [510.572, 500.618, 505.639]: Value: (null)
Count: 1
{code}

And running in Local Mode:
{code}
% rm -r /tmp/kmeans

% hama jar hama-examples-0.7.0-SNAPSHOT.jar kmeans /tmp/kmeans/in/input.txt /tmp/kmeans/out
10 1
Cannot read text input file: /tmp/kmeans/in/input.txt

% mkdir /tmp/kmeans &&mkdir /tmp/kmeans/in && echo -e "vec1\t0\t0\nvec2\t1\t1\nvec3\t2\t2\nvec4\t3\t3\nvec5\t4\t4\n\
vec6\t5\t5\nvec7\t6\t6\nvec8\t7\t7\nvec9\t8\t8\nvec10\t9\t9\nvec11\t10\t10" > /tmp/kmeans/in/input.txt

% hama jar hama-examples-0.7.0-SNAPSHOT.jar kmeans /tmp/kmeans/in/input.txt /tmp/kmeans/out
10 1

%  hama seqdumper -file file:///tmp/kmeans/in/textinput/in.seq
Input Path: file:/tmp/kmeans/in/textinput/in.seq
Key class: class org.apache.hama.commons.io.VectorWritable Value Class: class org.apache.hadoop.io.NullWritable
Key: vec1: [0.0, 0.0]: Value: (null)
Key: vec2: [1.0, 1.0]: Value: (null)
Key: vec3: [2.0, 2.0]: Value: (null)
Key: vec4: [3.0, 3.0]: Value: (null)
Key: vec5: [4.0, 4.0]: Value: (null)
Key: vec6: [5.0, 5.0]: Value: (null)
Key: vec7: [6.0, 6.0]: Value: (null)
Key: vec8: [7.0, 7.0]: Value: (null)
Key: vec9: [8.0, 8.0]: Value: (null)
Key: vec10: [9.0, 9.0]: Value: (null)
Key: vec11: [10.0, 10.0]: Value: (null)
Count: 11

% hama seqdumper -file file:///tmp/kmeans/out/center/center_output.seq
Input Path: file:/tmp/kmeans/out/center/center_output.seq
Key class: class org.apache.hama.commons.io.VectorWritable Value Class: class org.apache.hadoop.io.NullWritable
Key: [5.0, 5.0]: Value: (null)
Count: 1

% hama jar hama/hama-examples-0.7.0-SNAPSHOT.jar kmeans /tmp/kmeans/in /tmp/kmeans/out 10
1 -g 1000 3

% hama seqdumper -file file:///tmp/kmeans/out/center/center_output.seq
Input Path: file:/tmp/kmeans/out/center/center_output.seq
Key class: class org.apache.hama.commons.io.VectorWritable Value Class: class org.apache.hadoop.io.NullWritable
Key: [498.322, 499.2, 499.744]: Value: (null)
Count: 1
{code}

Maybe we should add the KMeans example here \[1] and reference to \[2].
Finally I will commit HAMA-834_v03.patch.

\[1] http://hama.apache.org/run_examples.html
\[1] https://blogs.apache.org/hama/entry/running_hama_k_means_in

> Fix KMeans example
> ------------------
>
>                 Key: HAMA-834
>                 URL: https://issues.apache.org/jira/browse/HAMA-834
>             Project: Hama
>          Issue Type: Bug
>          Components: examples, machine learning
>    Affects Versions: 0.6.3
>            Reporter: Martin Illecker
>              Labels: example
>             Fix For: 0.7.0
>
>         Attachments: HAMA-834.patch, HAMA-834_v02.patch, HAMA-834_v03.patch
>
>
> Fix problems in KMeans example and revise test case.
> 1) Typo \[1] and input path issue
> 2) Wrong *summationCount* in assignCentersInternal
> *summationCount* should also be incremented if \[2] 
> {code}
> if (clusterCenter == null) {
>   newCenterArray[lowestDistantCenter] = key;
> }
> {code}
> Otherwise *summationCount* may stay zero when only one value is assigned. Then this zero
will be propagated to *incrementSum* \[3] and might cause a divide by zero in \[4]. 
> By the way if we add three vectors and the *summationCount* would only be two, this will
lead to wrong results. Because later we are dividing the vector by the amount of increments.
> 3) Results depend on the amount *numBspTask*
> (results vary if *numBspTask* is changed)
> \[1]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L518-519
> \[2] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L249
> \[3]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L161
> \[4] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L172



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message