spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "RJ Nowling (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
Date Wed, 14 Jan 2015 20:40:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631
] 

RJ Nowling commented on SPARK-4894:
-----------------------------------

Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`.
It would be a string with a default value of `Multinomial`.  For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`.
If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled,
we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities
for the 0-valued features.   (Breeze may not allow adding/subtracting scalars and vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 entries.  In the
Bernoulli model, we would be adding a separate term for 0-valued features.

Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py

Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values.

[~mengxr], [~josephkb] Any thoughts or comments?

> Add Bernoulli-variant of Naive Bayes
> ------------------------------------
>
>                 Key: SPARK-4894
>                 URL: https://issues.apache.org/jira/browse/SPARK-4894
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli version of
Naive Bayes is more useful for situations where the features are binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message