spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pierson, Oliver C" <>
Subject Opening a JIRA for QuantileDiscretizer bug
Date Tue, 23 Feb 2016 02:45:07 GMT

  I've discovered a bug in the QuantileDiscretizer estimator.  Specifically, for large DataFrames
QuantileDiscretizer will only create one split (i.e. two bins).

The error happens in lines 113 and 114 of QuantileDiscretizer.scala:

    val requiredSamples = math.max(numBins * numBins, 10000)

    val fraction = math.min(requiredSamples / dataset.count(), 1.0)

After the first line, requiredSamples is an Int.  Therefore, if requiredSamples > dataset.count()
then fraction is always 0.0.

The problem can be simply fixed by replacing the first with:

  val requiredSamples = math.max(numBins * numBins, 10000.0)

I've implemented this change in my fork and all tests passed (except for docker integration,
but I think that's another issue).  I'm happy to submit a PR if it will ease someone else's
workload.  However, I'm unsure of how to create a JIRA.  I've created an account on the issue
tracker ( but when I try to create an issue it asks me to choose a "Service
Desk".  Which one should I be choosing?

Thanks much,

Oliver Pierson

View raw message