systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deron Eriksson <deroneriks...@gmail.com>
Subject Re: User friendly output of univariate statistics
Date Wed, 03 Feb 2016 06:13:15 GMT
Hi Ethan,

I think you make a great point with regards to the readability of the
output from Univar-Stats.dml.

Do you think outputting the user-friendly results in the format you
describe to the console while still writing the more mathematical results
to a file would be the type of behavior that you would find most useful? Or
would you also like to see the user-friendly results also sent to a file?

Also, I was wondering, do you think a single user-friendly format is
sufficient, or do you think that data scientists would like (or expect) to
be able to have multiple formats such as you described?

The table format is very interesting. Currently DML has a basic print
statement, but I don't believe it can be used to format data into columns,
such as in your table format example. It might be very nice to add a
c-style "printf" statement, which would allow results to be written to the
console in a more columnar format.

Does anyone else have any thoughts?

Deron


On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <ethanxu@us.ibm.com> wrote:

> dml is quite amazing. I was wondering if there is a user friendly (more
> human readable) version of outputs from Univar-Stats.dml? I ran the
> Univar-Stats.dml on my data set that contains 7 variables: two continuous,
> one categorical. The output is a csv file on HDFS that looks like this:
>
> 1 1 10.0
> 2 1 123.0
> 2 7 469.0
> 3 1 122.0
> 3 7 419.0
> 4 1 34.852512104922082
> 4 7 0.40786451178676335
> 5 1 613.6600902369631
> 5 7 1.5322171660886
> 6 1 25.566777079580508
> 6 7 5.54382044429201915
> 7 1 0.219263232610989764
> 7 7 12.14558700418414E-4
> 8 1 0.5323447433694138
> 8 7 1.23151883029726626
> 9 1 0.28352047550156284
> 9 7 23.25049533659206
> 10 1 -0.5348573740280274
> 10 7 2023.294658877635
> 11 1 2.874872545380876E-4
> 11 7 1.874872545380876E-4
> 12 1 6.0017749742760714085
> 12 7 0.00237749742760714085
> 13 1 12.0
> 14 1 30.56066514110724
> 15 2 4.0
> ---- truncated (numbers randomly modified)
>
> According to the documentation on
>
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
> , the first column of the matrix represents statistics type (minimum,
> mean, etc.), the second column represents variable ID and the last column
> gives the statistics value.
>
> While the documentation is very clear and the results are consistent with
> outputs of other software like R, I found the format a bit inconvenient
> since I have to refer to the reference Table (table 1 in aforementioned
> link) to understand the summary statistics.
>
> I understand that the pure numeric matrix format is easy to use as machine
> input for future steps. An additional table that is more human readable
> would be nice since the main purpose of uni-variate statistics is often
> exploratory data analysis and a clear summary is essential.
>
> Suggestions to consider in the readable summary if there's not already
> one:
> 1. Order the rows according to variables (column 2) instead of statistics
> type (column 1), so that summary statistics of the same variable are
> grouped together.
> 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead of
> IDs (1, 2, etc).
> 3. Use actual predictor labels ("age", "gender", etc) instead of IDs (1,2,
> etc).
> 4. Use level labels for categorical predictors ("male", "female", etc)
> instead of IDs (1,2, etc).
> 5. Add counts of cases in each level for categorical variable in addition
> to modes. This gives the distribution information of the variable.
> 6. If the amount of data in the summary is manageable perhaps
> automatically pull the output of Univar-Stats.dml from HDFS to local
> machine and display the readable version on terminal?
>
> So the output could look like:
>
> age min 10
> age max 123
> age range 113
> age mean 60
> ...
> gender female.count 1000
> gender male.count 2000
> gender mode male
> ...
>
> or even a table format like in R:
>
> age                  gender
> min    10          female 1000
> max   123        male 2000
> range 113        mode male
> mean  60         ...
> ...
> Thanks much,
>
> Ethan Xu
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message