spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: MLib Documentation Update Needed
Date Mon, 26 Sep 2016 08:40:30 GMT
Yes I think that footnote could be a lot more prominent, or pulled up
right under the table.

I also think it would be fine to present the {0,1} formulation. It's
actually more recognizable, I think, for log-loss in that form. It's
probably less recognizable for hinge loss, but, consistency is more
important. There's just an extra (2y-1) term, at worst.

The loss here is per instance, and implicitly summed over all
instances. I think that is probably not confusing for the reader; if
they're reading this at all to double-check just what formulation is
being used, I think they'd know that. But, it's worth a note.

The loss is summed in the case of log-loss, not multiplied (if that's
what you're saying).

Those are decent improvements, feel free to open a pull request / JIRA.

On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede <> wrote:
> The loss function here for logistic regression is confusing. It seems to
> imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the
> very inconspicuous note quoted below (under Classification) says. We need to
> make this point more visible to avoid confusion.
> Better yet, we should replace the loss function listed with that for 0, 1 no
> matter how mathematically inconvenient, since that is what is actually
> implemented in Spark.
> More problematic, the loss function (even in this "convenient" form) is
> actually incorrect. This is because it is missing either a summation (sigma)
> in the log or product (pi) outside the log, as the loss for logistic is the
> log likelihood. So there are multiple problems with the documentation.
> Please advise on steps to fix for all version documentation or if there are
> already some in place.
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."

To unsubscribe e-mail:

View raw message