spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Decision forests don't work with non-trivial categorical features
Date Mon, 13 Oct 2014 08:04:23 GMT
I'm looking at this bit of code in DecisionTreeMetadata ...

val maxCategoriesForUnorderedFeature =
  ((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt
strategy.categoricalFeaturesInfo.foreach { case (featureIndex, numCategories) =>
  // Decide if some categorical features should be treated as
unordered features,
  //  which require 2 * ((1 << numCategories - 1) - 1) bins.
  // We do this check with log values to prevent overflows in case
numCategories is large.
  // The next check is equivalent to: 2 * ((1 << numCategories - 1) -
1) <= maxBins
  if (numCategories <= maxCategoriesForUnorderedFeature) {
    unorderedFeatures.add(featureIndex)
    numBins(featureIndex) = numUnorderedBins(numCategories)
  } else {
    numBins(featureIndex) = numCategories
  }
}

So if I have a feature with 40 values and less than about a trillion
bins, it gets treated as a continuous feature, which is meaningless.
It shortly throws an exception though since other parts of the code
expect this to be a categorical feature.

I think there's a bug here somewhere but wasn't sure whether it was
just 'not implemented' yet and so needs a better error message (and
should be implemented), or something else preventing this from working
as expected.

I'll wait a beat to get more info and then if needed open a JIRA. Thanks all.

On Mon, Oct 13, 2014 at 3:34 AM, Evan Sparks <evan.sparks@gmail.com> wrote:
> I was under the impression that we were using the usual sort by average response value
heuristic when storing histogram bins (and searching for optimal splits) in the tree code.
>
> Maybe Manish or Joseph can clarify?
>
>> On Oct 12, 2014, at 2:50 PM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> I'm having trouble getting decision forests to work with categorical
>> features. I have a dataset with a categorical feature with 40 values.
>> It seems to be treated as a continuous/numeric value by the
>> implementation.
>>
>> Digging deeper, I see there is some logic in the code that indicates
>> that categorical features over N values do not work unless the number
>> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
>> naive brute force condition, wherein the decision tree will test all
>> possible splits of the categorical value.
>>
>> However, this gets unusable quickly as the number of bins should be
>> tens or hundreds at best, and this requirement rules out categorical
>> values over more than 10 or so features as a result. But, of course,
>> it's not unusual to have categorical features with high cardinality.
>> It's almost common.
>>
>> There are some pretty fine heuristics for selecting 'bins' over
>> categorical features when the number of bins is far fewer than the
>> complete, exhaustive set.
>>
>> Before I open a JIRA or continue, does anyone know what I am talking
>> about, am I mistaken? Is this a real limitation and is it worth
>> pursuing these heuristics? I can't figure out how to proceed with
>> decision forests in MLlib otherwise.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message