mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Webster <gabriel_webs...@htc.com>
Subject Re: Why can't i train using the entire dataset while RMSE evaluation?
Date Fri, 29 Oct 2010 05:40:19 GMT
Read the wiki page; you might also want to read up on machine learning 
more generally.  The example that Tommy gave is the extreme, straw man 
example in which the training algorithm simply memorizes the training 
data.  Most real training algorithms don't actually return 100% accuracy 
on the training data, because they effectively compress the training 
data into a model, and because this compression is lossy, it can't 
memorize the data exactly.  But what you hope is that the compression 
throws out the information that is specific to the training data (and is 
thus useless for predicting test data points), and keeps the information 
that describes the general behaviour of the data (which will help 
predict the test data).  These ideas are formalized in, for example, 
Minimum Description Length, so you might want to read up on that as 
well.  But the upshot is that most real algorithms perform significantly 
better on seen training data than on unseen test data, so testing on 
training data gives you incorrectly high accuracy (incorrect because in 
the real world, you will be running your algorithm on unseen data).

On 10/29/10 1:32 PM, Sanjib Kumar Das wrote:
> Why do you say that it "does not make sense" ?
> Do you mean to say that if i train and test on the entire dataset, i should
> get an RMSE of 0 trivially?
> which is not true.
> Consider the SVD recommender.
> M != LR (where M is the original matrix and L,R matrices are obtained after
> factorization).
>
> okay, let me rephrase my doubt this way :
> Is it possible to specify one dataset for training and another dataset for
> testing while evaluating the recommender?
>
> On Fri, Oct 29, 2010 at 12:20 AM, Tommy Chheng<tommy.chheng@gmail.com>wrote:
>
>>   Training and testing with the entire set does not make sense. Read about
>> over fitting for more details: http://en.wikipedia.org/wiki/Overfitting
>>
>> "As a simple example, consider a database of retail purchases that includes
>> the item bought, the purchaser, and the date and time of purchase. It's easy
>> to construct a model that will fit the training set perfectly by using the
>> date and time of purchase to predict the other attributes; *but this model
>> will not generalize at all to new data, because those past times will never
>> occur again.*"
>>
>> @tommychheng
>>
>> On 10/28/10 10:11 PM, Sanjib Kumar Das wrote:
>>
>> Suppose i want to train with the entire data set and test it with the entire
>> data set, how should i go about it?
>>
>> On Fri, Oct 29, 2010 at 12:09 AM, Gabriel Webster<gabriel_webster@htc.com>
 <gabriel_webster@htc.com>wrote:
>>
>>
>>   Logically there is something wrong with setting the training percentage to
>> 1.0, because that means the testing percentage is 0.0!  If you don't test on
>> any items then you can't get an RMSE.
>>
>>
>> On 10/29/10 1:06 PM, Sanjib Kumar Das wrote:
>>
>>
>>   I want to train my recommender with the entire dataset while evaluating
>> it's
>> RMSE.
>> It gives a NaN when i set the trainingPercentage=1
>> I know i can set it to 0.99 and get my work done, but logically there is
>> nothing wrong in setting it to 1.0.
>>
>>
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message