spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
Date Thu, 10 Jul 2014 18:00:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057746#comment-14057746
] 

Sean Owen commented on SPARK-2433:
----------------------------------

Your "earlier implementation" is identical to "new implementation 1". This does not appear
to be the code in master, and I think it's only useful to propose fixes to the current version
of code.

> In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-2433
>                 URL: https://issues.apache.org/jira/browse/SPARK-2433
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 0.9.1
>         Environment: Any 
>            Reporter: Rahul K Bhojwani
>              Labels: easyfix, test
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Don't have much experience with reporting errors. This is first time. If something is
not clear please feel free to contact me (details given below)
> In the pyspark mllib library. 
> Path : \spark-0.9.1\python\pyspark\mllib\classification.py
> Class: NaiveBayesModel
> Method:  self.predict
> Earlier Implementation:
> def predict(self, x):
>     """Return the most likely class for a data vector x"""
>     return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
>         
> New Implementation:
> No:1
> def predict(self, x):
>     """Return the most likely class for a data vector x"""
>     return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
> No:2
> def predict(self, x):
>     """Return the most likely class for a data vector x"""
>     return numpy.argmax(self.pi + dot(x,self.theta.T))
> Explanation:
> No:1 is correct according to me. Don't know about No:2.
> Error one:
> The matrix self.theta is of dimension [n_classes , n_features]. 
> while the matrix x is of dimension [1 , n_features].
> Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
> It will always give error:  "ValueError: matrices are not aligned"
> In the commented example given in the classification.py, n_classes = n_features = 2.
That's why no error.
> Both Implementation no.1 and Implementation no. 2 takes care of it.
> Error 2:
> As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1
| class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
> and taking the class with max value.
> That's what implementation 1 is doing.
> In Implementation 2: 
> Its basically class with max value :
> ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n)
* P(class_n))
> Don't know if it gives the exact result.
> Thanks
> Rahul Bhojwani
> rahulbhojwani2003@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message