spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pierson, Oliver C" <...@gatech.edu>
Subject Opening a JIRA for QuantileDiscretizer bug
Date Tue, 23 Feb 2016 02:45:07 GMT
Hello,

  I've discovered a bug in the QuantileDiscretizer estimator.  Specifically, for large DataFrames
QuantileDiscretizer will only create one split (i.e. two bins).


The error happens in lines 113 and 114 of QuantileDiscretizer.scala:


    val requiredSamples = math.max(numBins * numBins, 10000)

    val fraction = math.min(requiredSamples / dataset.count(), 1.0)


After the first line, requiredSamples is an Int.  Therefore, if requiredSamples > dataset.count()
then fraction is always 0.0.


The problem can be simply fixed by replacing the first with:


  val requiredSamples = math.max(numBins * numBins, 10000.0)


I've implemented this change in my fork and all tests passed (except for docker integration,
but I think that's another issue).  I'm happy to submit a PR if it will ease someone else's
workload.  However, I'm unsure of how to create a JIRA.  I've created an account on the issue
tracker (issues.apache.org) but when I try to create an issue it asks me to choose a "Service
Desk".  Which one should I be choosing?


Thanks much,

Oliver Pierson



Mime
View raw message