mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <fr...@frankscholten.nl>
Subject Re: Data(Set) creation of for train and test.
Date Mon, 03 Feb 2014 21:09:27 GMT
Have a look at OnlineLogisticRegressionTest.iris().

Here List.subList() is used in combination with Collections.shuffle() to
make the train and test dataset split.

So you could first read the dataset in a list and then use this trick.

I just pushed an example to Github that also uses this approach but I
wrapped this logic into a utility

See: https://github.com/frankscholten/mahout-sgd-bank-marketing and

https://github.com/frankscholten/mahout-sgd-bank-marketing/blob/master/src/main/java/bankmarketing/util/TrainAndTestSetUtil.java

Cheers,

Frank


On Mon, Feb 3, 2014 at 10:01 PM, j.barrett Strausser <
j.barrett.strausser@gmail.com> wrote:

> Two part question.
>
> 1. String Descriptor for input data
>
> Can anyone confirm my reasoning on the following -
>
> I believe the below code does the following.  It says the first column is
> the feature to be predicted (is a label) all other columns are to be used
> in the tree construction e.g. as variable to split on.
>
> val descriptor = "L N N"
> val trainDataValues = fileAsStringArray("myTrainFile.csv");
> val data = DataLoader.loadData(DataLoader.generateDataset(descriptor,
> false, trainDataValues), trainDataValues);
>
> Where my "myTrainFile.csv has a form like
>
> "A", .45,.55
> ...
> ...
> "B" 33.3, 22.3
>
>
>
> 2. String Descriptor for input data
>
> I'm now provided a new file "myTestData.csv"
>
> This data has no labels, but is otherwise the same as above. So if I
> attempt to create a dataset an error will be thrown with complain of no
> label.
>
> All I'm interested in is being able to call forest.classify(..., ...) but
> I'm not sure how to correctly construct my training dataset.
>
> I cannot simply split the original dataset as is done in most examples.
>
>
> Any examples showing test data construction independent of the original
> training set would be appreciated.
>
>
> --
>
>
> https://github.com/bearrito
> @deepbearrito
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message