mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaobo Gu <guxiaobo1...@gmail.com>
Subject Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?
Date Thu, 02 Jun 2011 03:03:21 GMT
On our site we will use Logistic Regression in a batch manner,
customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31)
will be used to train the model, and customers entered in another time
frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model,
then the model will be used to predict users entered after 2011/6/1,
does this make sense, or should we feed all data from 2010/1/1 to
2011/5/31 to ALR, and let it do the hold-out internally?



On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> You don't *have* to have a separate validation set, but it isn't a bad idea.
>
> In particular, with large scale classifiers production data almost always
> comes from the future with respect to the training data.  The ADR can't hold
> out that way because it does on-line training only.  Thus, I would recommend
> recommend that you still have some kind of evaluation hold-out set
> segregated by time.
>
> Another very serious issue can happen if you have near duplicates in your
> data set.  That often happens in news-wire text, for example.  In that case,
> you would have significant over-fitting with ADR and you wouldn't have a
> clue without a real time-segregated hold-out set.
>
> On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <guxiaobo1982@gmail.com> wrote:
>
>> Hi,
>>
>> Because ADR split the training data internally automatically,so I
>> think we don't have to make a separate validation data set.
>>
>> Regards,
>>
>> Xiaobo Gu
>>
>

Mime
View raw message