Have a look at OnlineLogisticRegressionTest.iris(). Here List.subList() is used in combination with Collections.shuffle() to make the train and test dataset split. So you could first read the dataset in a list and then use this trick. I just pushed an example to Github that also uses this approach but I wrapped this logic into a utility See: https://github.com/frankscholten/mahout-sgd-bank-marketing and https://github.com/frankscholten/mahout-sgd-bank-marketing/blob/master/src/main/java/bankmarketing/util/TrainAndTestSetUtil.java Cheers, Frank On Mon, Feb 3, 2014 at 10:01 PM, j.barrett Strausser < j.barrett.strausser@gmail.com> wrote: > Two part question. > > 1. String Descriptor for input data > > Can anyone confirm my reasoning on the following - > > I believe the below code does the following. It says the first column is > the feature to be predicted (is a label) all other columns are to be used > in the tree construction e.g. as variable to split on. > > val descriptor = "L N N" > val trainDataValues = fileAsStringArray("myTrainFile.csv"); > val data = DataLoader.loadData(DataLoader.generateDataset(descriptor, > false, trainDataValues), trainDataValues); > > Where my "myTrainFile.csv has a form like > > "A", .45,.55 > ... > ... > "B" 33.3, 22.3 > > > > 2. String Descriptor for input data > > I'm now provided a new file "myTestData.csv" > > This data has no labels, but is otherwise the same as above. So if I > attempt to create a dataset an error will be thrown with complain of no > label. > > All I'm interested in is being able to call forest.classify(..., ...) but > I'm not sure how to correctly construct my training dataset. > > I cannot simply split the original dataset as is done in most examples. > > > Any examples showing test data construction independent of the original > training set would be appreciated. > > > -- > > > https://github.com/bearrito > @deepbearrito >