[ https://issues.apache.org/jira/browse/SPARK2433?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=14057746#comment14057746
]
Sean Owen commented on SPARK2433:

Your "earlier implementation" is identical to "new implementation 1". This does not appear
to be the code in master, and I think it's only useful to propose fixes to the current version
of code.
> In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
> 
>
> Key: SPARK2433
> URL: https://issues.apache.org/jira/browse/SPARK2433
> Project: Spark
> Issue Type: Bug
> Components: MLlib, PySpark
> Affects Versions: 0.9.1
> Environment: Any
> Reporter: Rahul K Bhojwani
> Labels: easyfix, test
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Don't have much experience with reporting errors. This is first time. If something is
not clear please feel free to contact me (details given below)
> In the pyspark mllib library.
> Path : \spark0.9.1\python\pyspark\mllib\classification.py
> Class: NaiveBayesModel
> Method: self.predict
> Earlier Implementation:
> def predict(self, x):
> """Return the most likely class for a data vector x"""
> return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
>
> New Implementation:
> No:1
> def predict(self, x):
> """Return the most likely class for a data vector x"""
> return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
> No:2
> def predict(self, x):
> """Return the most likely class for a data vector x"""
> return numpy.argmax(self.pi + dot(x,self.theta.T))
> Explanation:
> No:1 is correct according to me. Don't know about No:2.
> Error one:
> The matrix self.theta is of dimension [n_classes , n_features].
> while the matrix x is of dimension [1 , n_features].
> Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
> It will always give error: "ValueError: matrices are not aligned"
> In the commented example given in the classification.py, n_classes = n_features = 2.
That's why no error.
> Both Implementation no.1 and Implementation no. 2 takes care of it.
> Error 2:
> As basic implementation of naive bayes is: P(class_n  sample) = count_feature_1 * P(feature_1
 class_n ) * count_feature_n * P(feature_nclass_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
> and taking the class with max value.
> That's what implementation 1 is doing.
> In Implementation 2:
> Its basically class with max value :
> ( exp(count_feature_1) * P(feature_1  class_n ) * exp(count_feature_n) * P(feature_nclass_n)
* P(class_n))
> Don't know if it gives the exact result.
> Thanks
> Rahul Bhojwani
> rahulbhojwani2003@gmail.com

This message was sent by Atlassian JIRA
(v6.2#6252)
