hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martin Illecker (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-904) Fix Collaborative Filtering Example
Date Tue, 20 May 2014 12:42:40 GMT

    [ https://issues.apache.org/jira/browse/HAMA-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003167#comment-14003167
] 

Martin Illecker commented on HAMA-904:
--------------------------------------

Thanks for your fast response!

{quote}
1) Why are user and item features broadcasted \[2] by each peer and not shared by a global
input file on HDFS? (possible performance increase?)
    - Broadcasting will appear only once when BSP job started (maybe not huge performance
increase, even if we fix it somehow).
    - We dont have multiple input formats, thats why I needed to do partitioning on BSP superstep
    - Lets say we have 10Gb user and item features(maybe not realistic) and we have 10peers,
if we use global file, every peer probably
    will try to get all features (if I understand correctly your statement "global input file
on HDFS") which means
    a) 10Gb*10 network traffic
    b) filtration logic among 10Gb data in each peer
    c) memory overhead
    And I dont know exact way to get user/item feature we are interested in. Why I did it
in superstep, probably because 10Gb will be splitted by
    partitioner and sent to peers (10Gb traffic total) and then each peer will ask for interested
features which I guess less bandwidth in general.
{quote}
I don't know if it is the optimal way but I agree with you that broadcasting user and item
features might cause less network traffic than sharing a global file.
But I think I have to mention that you combine the user ("u") and item ("i") features with
the user/item ratings (preferences "p") in one input file. If the user and item features would
not be in the same input file we could partition the input by the user or item id of the preference.
These are just some thoughts for further improvements.

{quote}
2) Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in UserSimilarity \[3] and
ItemSimilarity \[4]
Probably thats because of taste, not big deal but, I guess at that time I thought KeyValuePair
should be used
for things which represents key/value usually hash, search related things. Sure userId/itemId
also can be considered as key,
but Pair makes them equal in terms of sense, you want to get userId ok get from pair.first,
you want to get score ok get it from pair.second
Thats not big deal, could be changed easily.
{quote}
I only would suggest to use our own *o.a.h.c.u.KeyValuePair* class instead of the *commons.math3.util*
package.
Just a suggestion to remove the external dependency but if you prefer the terminology of a
Pair then there is no problem.

{quote}
3) No need for normalizeWithBroadcastingValues \[5] when taskNum = 1
thats true Agree
4) Why are values sent to itself and received later? \[6] (possible performance increase?)
Yes, possible performance increase. Agree
{quote}
Based on your implementation, I built an easier and smaller one without user and item feature
support.
You can have a look at \[1] for possible improvements within the *normalizeWithBroadcastingValues*
method.

{quote}
6) Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank > 250) ->
Infinite / NaN)
Lack of knowledge in ML from my side. I just picked value which worked for me, thats bad for
sure.
{quote}
An easy solution would be to make this TETTA / ALPHA constant configurable.

Thanks for your time!

\[1] https://github.com/millecker/applications/blob/master/hama/hybrid/onlinecf/src/at/illecker/hama/hybrid/examples/onlinecf/OnlineCFTrainHybridBSP.java#L369-456

> Fix Collaborative Filtering Example
> -----------------------------------
>
>                 Key: HAMA-904
>                 URL: https://issues.apache.org/jira/browse/HAMA-904
>             Project: Hama
>          Issue Type: Bug
>          Components: examples, machine learning
>    Affects Versions: 0.6.4
>            Reporter: Martin Illecker
>            Priority: Minor
>              Labels: collaborative-filtering, examples, machine_learning
>             Fix For: 0.7.0
>
>
> *Fix Collaborative Filtering Example and revise test case.*
> I had a deep look into the collaborative filtering example of Ikhtiyor Ahmedov \[1] and
found the following questions / problems:
>  - Why are user and item features broadcasted \[2] by each peer and not shared by a global
input file on HDFS? (possible performance increase?)
>  - Use o.a.h.c.u.KeyValuePair instead of commons.math3.util.Pair in UserSimilarity \[3]
and ItemSimilarity \[4]
>  - No need for normalizeWithBroadcastingValues \[5] when taskNum = 1
>  - Why are values sent to itself and received later? \[6] (possible performance increase?)
>  - Why is every task saving all items? \[7] (duplicate saves?)
>  - Why is the default ALPHA value 0.01 and not 0.001? \[8] (if (matrixRank > 250)
-> Infinite / NaN)
> I hope Ikhtiyor Ahmedov will finally become a committer and helps us to solve these questions.
> Thanks!
> \[1] https://issues.apache.org/jira/browse/HAMA-612
> \[2] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L116-128
> \[3] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/UserSimilarity.java#L22
> \[4] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/ItemSimilarity.java#L22
> \[5] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L138
> \[6] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L323
> \[7] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/OnlineTrainBSP.java#L387-422
> \[8] https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/recommendation/cf/function/MeanAbsError.java#L62



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message