(1) If you want to play around with different data sizes, I would recommend to use our data generator for linear regression
https://github.com/apache/incubator-systemml/blob/master/scripts/datagen/genRandData4LinearRegression.dml

(2) Well, SystemML does not expose these physical data properties to the user (on purpose to ensure data independence). However, if you are curious, SystemML does a coalesce on checkpoints to reduce the number of partitions to <data size> / <hdfs block size> but only if this does not reduce the effective degree of parallelism (e.g., if you have only 8GB data, 128MB hdfs block size, 100 cores, and currently 90 partitions, we would not reduce this to 8GB/128MB=64 partitions because 64<100).

(3) The whole point of running stepwise linear regression is feature selection, so you'll get the selected features and estimated model parameters of these features as well as some information on the selection process. You can evaluate this model on a hold out test set or run some form of cross validation. However, keep in mind that for accuracy experiments, you might want to be very careful with random data.

Regards,
Matthias

Inactive hide details for Wenjie Zhuang ---04/03/2016 06:29:01 AM---Thanks a lot. I also have some other  questions. Could you Wenjie Zhuang ---04/03/2016 06:29:01 AM---Thanks a lot. I also have some other questions. Could you please help me figure them out?

From: Wenjie Zhuang <kaito@vt.edu>
To: Matthias Boehm/Almaden/IBM@IBMUS
Cc: dev@systemml.incubator.apache.org
Date: 04/03/2016 06:29 AM
Subject: Re: Gxuides about running SystemML by spark cluster





Thanks a lot. I also have some other  questions. Could you please help me figure them out?

1.  If I want the input size is 30G, how can I set it? I guess I should change parameters X, Y and B. But I'm not sure which script I can use.

2. Do you know how to control the partition number when I run StepLinearRgDS.dml on Spark? Is there a configuration file where I can set partition number?

3. What should the correct result be after  running StepLinearRgDS.dml? When the program ends, what can we get?

Thanks & Have a nice day!

2016年4月3日 1:08 AM,"Matthias Boehm" <mboehm@us.ibm.com>写道: