systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Xu <ethan.yifa...@gmail.com>
Subject Re: User friendly output of univariate statistics
Date Fri, 05 Feb 2016 03:33:58 GMT
Thanks for the clarification Shirish. Is the current 'ijv' format matrix of
the Univ-Stats.dml output used in any other build-in script?

If not I'd like to suggest a small change besides (or without) the user
friendly version that makes outcomes easier to read: switch 'i' and 'j' in
the outcome. That is, order rows of the matrix according to variables
(original j column) then the statistics type (original i column). This way
the info of one variable are grouped together.

There might be situations where grouping by statistics types make more
sense, but I felt the other way is more commonly used.

Ethan

On Thu, Feb 4, 2016 at 10:10 PM, Shirish Tatikonda <
shirish.tatikonda@gmail.com> wrote:

> Just to clarify: the current output is actually a matrix, in which rows
> denote stats and columns denote input variables. So, the output you see is
> simply the univariate stats matrix in IJV format.
> In a general case, the primary data type for input/output and computations
> in SystemML is a *matrix *(of course, *scalar* as well) -- with one
> exception of a *frame* type (which is used only in the context of
> *transform*).
>
> I agree with you that providing user-friendly output as in R output is very
> useful for data scientists -- it however requires a lot of effort to
> support such a functionality.
>
> Shirish
>
> On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <ethanxu@us.ibm.com> wrote:
>
> > Thank you Deron. From my personal experience printing a single type of
> > user-friendly result on console is usually enough for a quick inspection.
> > However that's in an interactive environment (like R interactive
> session),
> > where recreating the printout is simple.
> >
> > Since calling a dml scrip on hadoop might trigger a MapReduce job maybe
> > it's better to save the user-friendly version as a file too? Or perhaps
> > it's helpful to have a script that takes the original summary (plus some
> > metadata) as input, and produces the user-friendly output?
> >
> > Best,
> >
> > Ethan
> >
> >
> >
> > From:   Deron Eriksson <deroneriksson@gmail.com>
> > To:     dev@systemml.incubator.apache.org
> > Date:   02/03/2016 01:13 AM
> > Subject:        Re: User friendly output of univariate statistics
> >
> >
> >
> > Hi Ethan,
> >
> > I think you make a great point with regards to the readability of the
> > output from Univar-Stats.dml.
> >
> > Do you think outputting the user-friendly results in the format you
> > describe to the console while still writing the more mathematical results
> > to a file would be the type of behavior that you would find most useful?
> > Or
> > would you also like to see the user-friendly results also sent to a file?
> >
> > Also, I was wondering, do you think a single user-friendly format is
> > sufficient, or do you think that data scientists would like (or expect)
> to
> > be able to have multiple formats such as you described?
> >
> > The table format is very interesting. Currently DML has a basic print
> > statement, but I don't believe it can be used to format data into
> columns,
> > such as in your table format example. It might be very nice to add a
> > c-style "printf" statement, which would allow results to be written to
> the
> > console in a more columnar format.
> >
> > Does anyone else have any thoughts?
> >
> > Deron
> >
> >
> > On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <ethanxu@us.ibm.com> wrote:
> >
> > > dml is quite amazing. I was wondering if there is a user friendly (more
> > > human readable) version of outputs from Univar-Stats.dml? I ran the
> > > Univar-Stats.dml on my data set that contains 7 variables: two
> > continuous,
> > > one categorical. The output is a csv file on HDFS that looks like this:
> > >
> > > 1 1 10.0
> > > 2 1 123.0
> > > 2 7 469.0
> > > 3 1 122.0
> > > 3 7 419.0
> > > 4 1 34.852512104922082
> > > 4 7 0.40786451178676335
> > > 5 1 613.6600902369631
> > > 5 7 1.5322171660886
> > > 6 1 25.566777079580508
> > > 6 7 5.54382044429201915
> > > 7 1 0.219263232610989764
> > > 7 7 12.14558700418414E-4
> > > 8 1 0.5323447433694138
> > > 8 7 1.23151883029726626
> > > 9 1 0.28352047550156284
> > > 9 7 23.25049533659206
> > > 10 1 -0.5348573740280274
> > > 10 7 2023.294658877635
> > > 11 1 2.874872545380876E-4
> > > 11 7 1.874872545380876E-4
> > > 12 1 6.0017749742760714085
> > > 12 7 0.00237749742760714085
> > > 13 1 12.0
> > > 14 1 30.56066514110724
> > > 15 2 4.0
> > > ---- truncated (numbers randomly modified)
> > >
> > > According to the documentation on
> > >
> > >
> >
> >
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
> >
> > > , the first column of the matrix represents statistics type (minimum,
> > > mean, etc.), the second column represents variable ID and the last
> > column
> > > gives the statistics value.
> > >
> > > While the documentation is very clear and the results are consistent
> > with
> > > outputs of other software like R, I found the format a bit inconvenient
> > > since I have to refer to the reference Table (table 1 in aforementioned
> > > link) to understand the summary statistics.
> > >
> > > I understand that the pure numeric matrix format is easy to use as
> > machine
> > > input for future steps. An additional table that is more human readable
> > > would be nice since the main purpose of uni-variate statistics is often
> > > exploratory data analysis and a clear summary is essential.
> > >
> > > Suggestions to consider in the readable summary if there's not already
> > > one:
> > > 1. Order the rows according to variables (column 2) instead of
> > statistics
> > > type (column 1), so that summary statistics of the same variable are
> > > grouped together.
> > > 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead
> > of
> > > IDs (1, 2, etc).
> > > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs
> > (1,2,
> > > etc).
> > > 4. Use level labels for categorical predictors ("male", "female", etc)
> > > instead of IDs (1,2, etc).
> > > 5. Add counts of cases in each level for categorical variable in
> > addition
> > > to modes. This gives the distribution information of the variable.
> > > 6. If the amount of data in the summary is manageable perhaps
> > > automatically pull the output of Univar-Stats.dml from HDFS to local
> > > machine and display the readable version on terminal?
> > >
> > > So the output could look like:
> > >
> > > age min 10
> > > age max 123
> > > age range 113
> > > age mean 60
> > > ...
> > > gender female.count 1000
> > > gender male.count 2000
> > > gender mode male
> > > ...
> > >
> > > or even a table format like in R:
> > >
> > > age                  gender
> > > min    10          female 1000
> > > max   123        male 2000
> > > range 113        mode male
> > > mean  60         ...
> > > ...
> > > Thanks much,
> > >
> > > Ethan Xu
> > >
> > >
> >
> >
> >
> >
> >
>



-- 
Yifan "Ethan" Xu, PhD

Data Scientist / Statistician
Explorys, IBM Watson Health

Adjunct Faculty
Department of Epidemiology and Biostatistics
Case Western Reserve University

--------------
Email: ethan.yifanxu@gmail.com
Phone: (607) 760-6817
--------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message