Manish,
My use case for (asymmetric) absolute error is quite trivially quantile
regression. In other words, I want to use Spark to learn conditional
cumulative distribution functions. See R's GBM quantile regression option.
If you either find or create a Jira ticket, I would be happy to give it a
shot. Is there a design doc explaining how the gradient boosting algorithm
is laid out in MLLib? I tried reading the code, but without a "Rosetta
stone" it's impossible to make sense of it.
Alex
On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde <manish9ue@gmail.com> wrote:
> Hi Alessandro,
>
> I think absolute error as splitting criterion might be feasible with the
> current architecture  I think the sufficient statistics we collect
> currently might be able to support this. Could you let us know scenarios
> where absolute error has significantly outperformed squared error for
> regression trees? Also, what's your use case that makes squared error
> undesirable.
>
> For gradient boosting, you are correct. The weak hypothesis weights refer
> to tree predictions in each of the branches. We plan to explain this in
> the 1.2 documentation and may be add some more clarifications to the
> Javadoc.
>
> I will try to search for JIRAs or create new ones and update this thread.
>
> Manish
>
>
> On Monday, November 17, 2014, Alessandro Baretta <alexbaretta@gmail.com>
> wrote:
>
>> Manish,
>>
>> Thanks for pointing me to the relevant docs. It is unfortunate that
>> absolute error is not supported yet. I can't seem to find a Jira for it.
>>
>> Now, here's the what the comments say in the current master branch:
>> /**
>> * :: Experimental ::
>> * A class that implements Stochastic Gradient Boosting
>> * for regression and binary classification problems.
>> *
>> * The implementation is based upon:
>> * J.H. Friedman. "Stochastic Gradient Boosting." 1999.
>> *
>> * Notes:
>> *  This currently can be run with several loss functions. However,
>> only SquaredError is
>> * fully supported. Specifically, the loss function should be used to
>> compute the gradient
>> * (to relabel training instances on each iteration) and to weight
>> weak hypotheses.
>> * Currently, gradients are computed correctly for the available loss
>> functions,
>> * but weak hypothesis weights are not computed correctly for LogLoss
>> or AbsoluteError.
>> * Running with those losses will likely behave reasonably, but lacks
>> the same guarantees.
>> ...
>> */
>>
>> By the looks of it, the GradientBoosting API would support an absolute
>> error type loss function to perform quantile regression, except for "weak
>> hypothesis weights". Does this refer to the weights of the leaves of the
>> trees?
>>
>> Alex
>>
>> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde <manish9ue@gmail.com> wrote:
>>
>>> Hi Alessandro,
>>>
>>> MLlib v1.1 supports variance for regression and gini impurity and
>>> entropy for classification.
>>> http://spark.apache.org/docs/latest/mllibdecisiontree.html
>>>
>>> If the information gain calculation can be performed by distributed
>>> aggregation then it might be possible to plug it into the existing
>>> implementation. We want to perform such calculations (for e.g. median) for
>>> the gradient boosting models (coming up in the 1.2 release) using absolute
>>> error and deviance as loss functions but I don't think anyone is planning
>>> to work on it yet. :)
>>>
>>> Manish
>>>
>>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
>>> alexbaretta@gmail.com> wrote:
>>>
>>>> I see that, as of v. 1.1, MLLib supports regression and classification
>>>> tree
>>>> models. I assume this means that it uses a squarederror loss function
>>>> for
>>>> the first and logistic cost function for the second. I don't see support
>>>> for quantile regression via an absolute error cost function. Or am I
>>>> missing something?
>>>>
>>>> If, as it seems, this is missing, how do you recommend to implement it?
>>>>
>>>> Alex
>>>>
>>>
>>>
>>
