mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jyotiranjan panda <>
Subject Re: Confused about train/test data split in recommender evaluation
Date Tue, 11 Nov 2014 06:48:35 GMT
I have done classification using mahout.
suppose a file named Testfile has size of 20 Mb and test contents are as

*category  id                  description*
sports      xxx                cricket,football etc
sports      xxxx                cricket,vollyball etc
news       yyyy                poltical,etc
news       ppppp              news channel

Now in above file, we want to do text categorization( suppose we have 2
category sports and news)
suppose our 60% data consists of first 4 lines of Testfile and 40% consists
of last line.
Than if I want to use 60% as trained data and 40 % as test data than mahout
will train with first 4 line and will make a binary model.
Now while testing it will remove the category from the last line(i.e 40% of
file) and will input this file to model to test.

so that , the result category can be compared with the actual file and
efficiency of algorithm can be evaluated.

I think same applied to your case too.

On Tue, Nov 11, 2014 at 11:58 AM, Blade Liu <> wrote:

> Hi,
> I'm new to Mahout and got confused how train and test data are split when
> evaluating recommenders.
> I'm not sure whether data is split based on selecting partial item
> preferences, or selecting specific users(together with all their
> preferences). For example, train data accounts for 60%, and test data
> accounts for 40%. Does it indicates 40% total preferences will used for
> testing(regardless associated users)?  In classification, all features
> associated with the users will be selected..
> If partition criteria is based on preference, would it affect neighborhood
> similarity before computing recommended score?
> Cheers,
> Blade

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message