Mailing-List: contact dev-help@hama.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hama.apache.org
Date: Tue, 20 May 2014 09:48:38 +0000 (UTC)
From: "Ikhtiyor Ahmedov (JIRA)" <jira@apache.org>
To: dev@hama.apache.org
Message-ID: <JIRA.12715178.1400507319589.10002.1400579318344@arcas>
In-Reply-To: <JIRA.12715178.1400507319589@arcas>
References: <JIRA.12715178.1400507319589@arcas>
Subject: [jira] [Commented] (HAMA-904) Fix Collaborative Filtering Example
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HAMA-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003022#comment-14003022 ] 

Ikhtiyor Ahmedov commented on HAMA-904:
---------------------------------------

Thanks for feedback, I dont remember exactly everything, let me try to answer some questions.
My assumptions maybe incorrect now comparing to the time when I implemented it.

1) Why are user and item features broadcasted [2] by each peer and not shared by a global input file on HDFS? (possible performance increase?)
- Broadcasting will appear only once when BSP job started (maybe not huge performance increase, even if we fix it somehow).
- We dont have multiple input formats, thats why I needed to do partitioning on BSP superstep
- Lets say we have 10Gb user and item features(maybe not realistic) and we have 10peers, if we use global file, every peer probably
will try to get all features (if I understand correctly your statement "global input file on HDFS") which means 
a) 10Gb*10 network traffic
b) filtration logic among 10Gb data in each peer
c) memory overhead
And I dont know exact way to get user/item feature we are interested in. Why I did it in superstep, probably because 10Gb will be splitted by 
partitioner and sent to peers (10Gb traffic total) and then each peer will ask for interested features which I guess less bandwidth in general.

2) Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in UserSimilarity [3] and ItemSimilarity [4]
Probably thats because of taste, not big deal but, I guess at that time I thought KeyValuePair should be used
for things which represents key/value usually hash, search related things. Sure userId/itemId also can be considered as key,
but Pair makes them equal in terms of sense, you want to get userId ok get from pair.first, you want to get score ok get it from pair.second
Thats not big deal, could be changed easily.

3) No need for normalizeWithBroadcastingValues [5] when taskNum = 1
thats true :) Agree

4) Why are values sent to itself and received later? [6] (possible performance increase?)
Yes, possible performance increase. Agree

5) Why is every task saving all items? [7] (duplicate saves?)
Yes, agree probably needs some logic to save distinct items
I guess I skipped that logic intentionally, because when you load
items for further filtering you will use Recommender which will 
filter out duplicate items.  

6) Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank > 250) -> Infinite / NaN)
Lack of knowledge in ML from my side. I just picked value which worked for me, thats bad for sure.

I hope you can give further feedback on issues will try to fix, I need to build and prepare
development environment, which probably I will do on weekend.
Thanks.

> Fix Collaborative Filtering Example
> -----------------------------------
>
>                 Key: HAMA-904
>                 URL: https://issues.apache.org/jira/browse/HAMA-904
>             Project: Hama
>          Issue Type: Bug
>          Components: examples, machine learning
>    Affects Versions: 0.6.4
>            Reporter: Martin Illecker
>            Priority: Minor
>              Labels: collaborative-filtering, examples, machine_learning
>             Fix For: 0.7.0
>
>
> *Fix Collaborative Filtering Example and revise test case.*
> I had a deep look into the collaborative filtering example of Ikhtiyor Ahmedov \[1] and found the following questions / problems:
>  - Why are user and item features broadcasted \[2] by each peer and not shared by a global input file on HDFS? (possible performance increase?)
>  - Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in UserSimilarity \[3] and ItemSimilarity \[4]
>  - No need for normalizeWithBroadcastingValues \[5] when taskNum = 1
>  - Why are values sent to itself and received later? \[6] (possible performance increase?)
>  - Why is every task saving all items? \[7] (duplicate saves?)
>  - Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank > 250) -> Infinite / NaN)
> I hope Ikhtiyor Ahmedov will finally become a committer and helps us to solve these questions.
> Thanks!
> \[1] https://issues.apache.org/jira/browse/HAMA-612
> \[2] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L116-128
> \[3] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/UserSimilarity.java#L22
> \[4] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/ItemSimilarity.java#L22
> \[5] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L138
> \[6] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L323
> \[7] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L387-422
> \[8] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/function/MeanAbsError.java#L62


--
This message was sent by Atlassian JIRA
(v6.2#6252)