systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ethan Xu" <etha...@us.ibm.com>
Subject Re: User friendly output of univariate statistics
Date Thu, 04 Feb 2016 05:09:00 GMT
Thank you Deron. From my personal experience printing a single type of 
user-friendly result on console is usually enough for a quick inspection. 
However that's in an interactive environment (like R interactive session), 
where recreating the printout is simple. 
 
Since calling a dml scrip on hadoop might trigger a MapReduce job maybe 
it's better to save the user-friendly version as a file too? Or perhaps 
it's helpful to have a script that takes the original summary (plus some 
metadata) as input, and produces the user-friendly output?
 
Best,
 
Ethan



From:   Deron Eriksson <deroneriksson@gmail.com>
To:     dev@systemml.incubator.apache.org
Date:   02/03/2016 01:13 AM
Subject:        Re: User friendly output of univariate statistics



Hi Ethan,

I think you make a great point with regards to the readability of the
output from Univar-Stats.dml.

Do you think outputting the user-friendly results in the format you
describe to the console while still writing the more mathematical results
to a file would be the type of behavior that you would find most useful? 
Or
would you also like to see the user-friendly results also sent to a file?

Also, I was wondering, do you think a single user-friendly format is
sufficient, or do you think that data scientists would like (or expect) to
be able to have multiple formats such as you described?

The table format is very interesting. Currently DML has a basic print
statement, but I don't believe it can be used to format data into columns,
such as in your table format example. It might be very nice to add a
c-style "printf" statement, which would allow results to be written to the
console in a more columnar format.

Does anyone else have any thoughts?

Deron


On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <ethanxu@us.ibm.com> wrote:

> dml is quite amazing. I was wondering if there is a user friendly (more
> human readable) version of outputs from Univar-Stats.dml? I ran the
> Univar-Stats.dml on my data set that contains 7 variables: two 
continuous,
> one categorical. The output is a csv file on HDFS that looks like this:
>
> 1 1 10.0
> 2 1 123.0
> 2 7 469.0
> 3 1 122.0
> 3 7 419.0
> 4 1 34.852512104922082
> 4 7 0.40786451178676335
> 5 1 613.6600902369631
> 5 7 1.5322171660886
> 6 1 25.566777079580508
> 6 7 5.54382044429201915
> 7 1 0.219263232610989764
> 7 7 12.14558700418414E-4
> 8 1 0.5323447433694138
> 8 7 1.23151883029726626
> 9 1 0.28352047550156284
> 9 7 23.25049533659206
> 10 1 -0.5348573740280274
> 10 7 2023.294658877635
> 11 1 2.874872545380876E-4
> 11 7 1.874872545380876E-4
> 12 1 6.0017749742760714085
> 12 7 0.00237749742760714085
> 13 1 12.0
> 14 1 30.56066514110724
> 15 2 4.0
> ---- truncated (numbers randomly modified)
>
> According to the documentation on
>
> 
http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics

> , the first column of the matrix represents statistics type (minimum,
> mean, etc.), the second column represents variable ID and the last 
column
> gives the statistics value.
>
> While the documentation is very clear and the results are consistent 
with
> outputs of other software like R, I found the format a bit inconvenient
> since I have to refer to the reference Table (table 1 in aforementioned
> link) to understand the summary statistics.
>
> I understand that the pure numeric matrix format is easy to use as 
machine
> input for future steps. An additional table that is more human readable
> would be nice since the main purpose of uni-variate statistics is often
> exploratory data analysis and a clear summary is essential.
>
> Suggestions to consider in the readable summary if there's not already
> one:
> 1. Order the rows according to variables (column 2) instead of 
statistics
> type (column 1), so that summary statistics of the same variable are
> grouped together.
> 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead 
of
> IDs (1, 2, etc).
> 3. Use actual predictor labels ("age", "gender", etc) instead of IDs 
(1,2,
> etc).
> 4. Use level labels for categorical predictors ("male", "female", etc)
> instead of IDs (1,2, etc).
> 5. Add counts of cases in each level for categorical variable in 
addition
> to modes. This gives the distribution information of the variable.
> 6. If the amount of data in the summary is manageable perhaps
> automatically pull the output of Univar-Stats.dml from HDFS to local
> machine and display the readable version on terminal?
>
> So the output could look like:
>
> age min 10
> age max 123
> age range 113
> age mean 60
> ...
> gender female.count 1000
> gender male.count 2000
> gender mode male
> ...
>
> or even a table format like in R:
>
> age                  gender
> min    10          female 1000
> max   123        male 2000
> range 113        mode male
> mean  60         ...
> ...
> Thanks much,
>
> Ethan Xu
>
>





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message