dml is quite amazing. I was wondering if there is a user friendly (more
human readable) version of outputs from UnivarStats.dml? I ran the
UnivarStats.dml on my data set that contains 7 variables: two continuous,
one categorical. The output is a csv file on HDFS that looks like this:
1 1 10.0
2 1 123.0
2 7 469.0
3 1 122.0
3 7 419.0
4 1 34.852512104922082
4 7 0.40786451178676335
5 1 613.6600902369631
5 7 1.5322171660886
6 1 25.566777079580508
6 7 5.54382044429201915
7 1 0.219263232610989764
7 7 12.14558700418414E4
8 1 0.5323447433694138
8 7 1.23151883029726626
9 1 0.28352047550156284
9 7 23.25049533659206
10 1 0.5348573740280274
10 7 2023.294658877635
11 1 2.874872545380876E4
11 7 1.874872545380876E4
12 1 6.0017749742760714085
12 7 0.00237749742760714085
13 1 12.0
14 1 30.56066514110724
15 2 4.0
 truncated (numbers randomly modified)
According to the documentation on
http://apache.github.io/incubatorsystemml/algorithmsdescriptivestatistics.html#univariatestatistics
, the first column of the matrix represents statistics type (minimum,
mean, etc.), the second column represents variable ID and the last column
gives the statistics value.
While the documentation is very clear and the results are consistent with
outputs of other software like R, I found the format a bit inconvenient
since I have to refer to the reference Table (table 1 in aforementioned
link) to understand the summary statistics.
I understand that the pure numeric matrix format is easy to use as machine
input for future steps. An additional table that is more human readable
would be nice since the main purpose of univariate statistics is often
exploratory data analysis and a clear summary is essential.
Suggestions to consider in the readable summary if there's not already
one:
1. Order the rows according to variables (column 2) instead of statistics
type (column 1), so that summary statistics of the same variable are
grouped together.
2. Use actual statistics labels ("min", "mean", "skewness" etc) instead of
IDs (1, 2, etc).
3. Use actual predictor labels ("age", "gender", etc) instead of IDs (1,2,
etc).
4. Use level labels for categorical predictors ("male", "female", etc)
instead of IDs (1,2, etc).
5. Add counts of cases in each level for categorical variable in addition
to modes. This gives the distribution information of the variable.
6. If the amount of data in the summary is manageable perhaps
automatically pull the output of UnivarStats.dml from HDFS to local
machine and display the readable version on terminal?
So the output could look like:
age min 10
age max 123
age range 113
age mean 60
...
gender female.count 1000
gender male.count 2000
gender mode male
...
or even a table format like in R:
age gender
min 10 female 1000
max 123 male 2000
range 113 mode male
mean 60 ...
...
Thanks much,
Ethan Xu
