I prefer to make my final held-out set look as much like it will in production. So if you plan to retrain every week, I would train on all available data up to time t and then test on data from t to t+1week. ALR's internal hold-out set is useful, but things change over time and having a held out sample from the future (relative to the model) is much more realistic. On Wed, Jun 1, 2011 at 8:03 PM, Xiaobo Gu wrote: > On our site we will use Logistic Regression in a batch manner, > customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31) > will be used to train the model, and customers entered in another time > frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model, > then the model will be used to predict users entered after 2011/6/1, > does this make sense, or should we feed all data from 2010/1/1 to > 2011/5/31 to ALR, and let it do the hold-out internally? > > > > On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning > wrote: > > You don't *have* to have a separate validation set, but it isn't a bad > idea. > > > > In particular, with large scale classifiers production data almost always > > comes from the future with respect to the training data. The ADR can't > hold > > out that way because it does on-line training only. Thus, I would > recommend > > recommend that you still have some kind of evaluation hold-out set > > segregated by time. > > > > Another very serious issue can happen if you have near duplicates in your > > data set. That often happens in news-wire text, for example. In that > case, > > you would have significant over-fitting with ADR and you wouldn't have a > > clue without a real time-segregated hold-out set. > > > > On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu > wrote: > > > >> Hi, > >> > >> Because ADR split the training data internally automatically,so I > >> think we don't have to make a separate validation data set. > >> > >> Regards, > >> > >> Xiaobo Gu > >> > > >