mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-384) Implement of AVF algorithm
Date Thu, 22 Apr 2010 07:25:49 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859693#action_12859693
] 

Robin Anil commented on MAHOUT-384:
-----------------------------------

Hi Tony. Nice work on the patch. But before we commit this, there are a couple of things you
need to cover. I still have to read the algorithm in detail to know whats happening. But I
have some queries and suggestions below which is a kind of a checklist to make this a commitable
patch

1) I am not a fan of Text based input, though it is what most of the algorithms in Mahout
was first implement in. The idea of splitting and joining text files based on comma is not
very clean. Can you convert this to deal with SequenceFile of VectorWritable OR some other
Writable Format? Whats your input schema?
2) There is a code-style we enforce in Mahout. You can use the mvn checkstyle:checkstyle to
see the violations. We also have an eclipse formatter which formats code that almost match
the checkstyle(there are rare manual interventions required). Take a look at this https://cwiki.apache.org/MAHOUT/howtocontribute.html
you will find the Eclipse formatter file at the bottom
3) For parsing args use the apache commons cli2 library. Take a look at o/a/m/clustering/kmeans/KMeansDriver
to see usage
4) What is Utils being used for?
5) @Override
+	public void setup(Context context) throws IOException,InterruptedException{
+
+		String filePath = context.getConfiguration().get("a");
+		sumAttribute = Utils.readFile(filePath+"/part-r-00000");
+		
+	}
Please use distributed cache to read the file in a map/reduce context. See the DictionaryVectorizer
Map/Reduce classes for usage
6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability of this algorithm?
Is the single reducer going to get a lot of data from the mapper? If Yes, then you should
think of removing this constraint and let it use the hadoop parameters or parameterize it
7) Can this job be Optimised using a Combiner? If yes, its really worth spending time to make
one
8) Tests! :)

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and introduced
by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by original input
data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message