spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "RJ Nowling (JIRA)" <>
Subject [jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
Date Thu, 15 Jan 2015 04:20:34 GMT


RJ Nowling commented on SPARK-4894:

[~josephkb], after some thought, I've come around and think your idea of 1 NB class with a
Factor type parameter may be the more maintainable choice as well as offering some novel functionality.
 But, there seems to be a lot to figure out (we should be checking the decision tree implementation
for example) and I don't want to hold up what should be a relatively simple change to support
Bernoulli NB.  What do you think?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports Gaussian NB
but is this used in practice?  My understanding is that NB is generally used for text classification
with counts or binary values, possibly weighted by TF-IDF.   We should probably email the
users and dev lists to get user feedback.  If no one is asking for it, we should shelve it
and focus on other things.

(2) after some more reflection, I can see a few more benefits to your suggestions of feature
types (e.g., categorial, discrete counts, continuous, binary, etc.).  If we created corresponding
FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition
which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn.
 Additionally, if the user can define a type for each feature, then users can mix and match
likelihood types as well.  Most NB implementations treat all features the same -- what if
we had a model that allowed heterozygous features?  If it works well in NB, it could be extended
to other parts of MLlib.  (There is likely some overlap with decision trees since they support
multiple feature types, so we might want to see if there is anything there we can reuse.)
 At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat}
like the current API so that simplicity isn't compromised and provide a more advanced API
for power users.

> Add Bernoulli-variant of Naive Bayes
> ------------------------------------
>                 Key: SPARK-4894
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli version of
Naive Bayes is more useful for situations where the features are binary values.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message