systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shirish Tatikonda <shirish.tatiko...@gmail.com>
Subject Re: User friendly output of univariate statistics
Date Fri, 05 Feb 2016 03:10:43 GMT
Just to clarify: the current output is actually a matrix, in which rows
denote stats and columns denote input variables. So, the output you see is
simply the univariate stats matrix in IJV format.
In a general case, the primary data type for input/output and computations
in SystemML is a *matrix *(of course, *scalar* as well) -- with one
exception of a *frame* type (which is used only in the context of
*transform*).

I agree with you that providing user-friendly output as in R output is very
useful for data scientists -- it however requires a lot of effort to
support such a functionality.

Shirish

On Wed, Feb 3, 2016 at 9:09 PM, Ethan Xu <ethanxu@us.ibm.com> wrote:

> Thank you Deron. From my personal experience printing a single type of
> user-friendly result on console is usually enough for a quick inspection.
> However that's in an interactive environment (like R interactive session),
> where recreating the printout is simple.
>
> Since calling a dml scrip on hadoop might trigger a MapReduce job maybe
> it's better to save the user-friendly version as a file too? Or perhaps
> it's helpful to have a script that takes the original summary (plus some
> metadata) as input, and produces the user-friendly output?
>
> Best,
>
> Ethan
>
>
>
> From:   Deron Eriksson <deroneriksson@gmail.com>
> To:     dev@systemml.incubator.apache.org
> Date:   02/03/2016 01:13 AM
> Subject:        Re: User friendly output of univariate statistics
>
>
>
> Hi Ethan,
>
> I think you make a great point with regards to the readability of the
> output from Univar-Stats.dml.
>
> Do you think outputting the user-friendly results in the format you
> describe to the console while still writing the more mathematical results
> to a file would be the type of behavior that you would find most useful?
> Or
> would you also like to see the user-friendly results also sent to a file?
>
> Also, I was wondering, do you think a single user-friendly format is
> sufficient, or do you think that data scientists would like (or expect) to
> be able to have multiple formats such as you described?
>
> The table format is very interesting. Currently DML has a basic print
> statement, but I don't believe it can be used to format data into columns,
> such as in your table format example. It might be very nice to add a
> c-style "printf" statement, which would allow results to be written to the
> console in a more columnar format.
>
> Does anyone else have any thoughts?
>
> Deron
>
>
> On Tue, Feb 2, 2016 at 8:32 AM, Ethan Xu <ethanxu@us.ibm.com> wrote:
>
> > dml is quite amazing. I was wondering if there is a user friendly (more
> > human readable) version of outputs from Univar-Stats.dml? I ran the
> > Univar-Stats.dml on my data set that contains 7 variables: two
> continuous,
> > one categorical. The output is a csv file on HDFS that looks like this:
> >
> > 1 1 10.0
> > 2 1 123.0
> > 2 7 469.0
> > 3 1 122.0
> > 3 7 419.0
> > 4 1 34.852512104922082
> > 4 7 0.40786451178676335
> > 5 1 613.6600902369631
> > 5 7 1.5322171660886
> > 6 1 25.566777079580508
> > 6 7 5.54382044429201915
> > 7 1 0.219263232610989764
> > 7 7 12.14558700418414E-4
> > 8 1 0.5323447433694138
> > 8 7 1.23151883029726626
> > 9 1 0.28352047550156284
> > 9 7 23.25049533659206
> > 10 1 -0.5348573740280274
> > 10 7 2023.294658877635
> > 11 1 2.874872545380876E-4
> > 11 7 1.874872545380876E-4
> > 12 1 6.0017749742760714085
> > 12 7 0.00237749742760714085
> > 13 1 12.0
> > 14 1 30.56066514110724
> > 15 2 4.0
> > ---- truncated (numbers randomly modified)
> >
> > According to the documentation on
> >
> >
>
> http://apache.github.io/incubator-systemml/algorithms-descriptive-statistics.html#univariate-statistics
>
> > , the first column of the matrix represents statistics type (minimum,
> > mean, etc.), the second column represents variable ID and the last
> column
> > gives the statistics value.
> >
> > While the documentation is very clear and the results are consistent
> with
> > outputs of other software like R, I found the format a bit inconvenient
> > since I have to refer to the reference Table (table 1 in aforementioned
> > link) to understand the summary statistics.
> >
> > I understand that the pure numeric matrix format is easy to use as
> machine
> > input for future steps. An additional table that is more human readable
> > would be nice since the main purpose of uni-variate statistics is often
> > exploratory data analysis and a clear summary is essential.
> >
> > Suggestions to consider in the readable summary if there's not already
> > one:
> > 1. Order the rows according to variables (column 2) instead of
> statistics
> > type (column 1), so that summary statistics of the same variable are
> > grouped together.
> > 2. Use actual statistics labels ("min", "mean", "skewness" etc) instead
> of
> > IDs (1, 2, etc).
> > 3. Use actual predictor labels ("age", "gender", etc) instead of IDs
> (1,2,
> > etc).
> > 4. Use level labels for categorical predictors ("male", "female", etc)
> > instead of IDs (1,2, etc).
> > 5. Add counts of cases in each level for categorical variable in
> addition
> > to modes. This gives the distribution information of the variable.
> > 6. If the amount of data in the summary is manageable perhaps
> > automatically pull the output of Univar-Stats.dml from HDFS to local
> > machine and display the readable version on terminal?
> >
> > So the output could look like:
> >
> > age min 10
> > age max 123
> > age range 113
> > age mean 60
> > ...
> > gender female.count 1000
> > gender male.count 2000
> > gender mode male
> > ...
> >
> > or even a table format like in R:
> >
> > age                  gender
> > min    10          female 1000
> > max   123        male 2000
> > range 113        mode male
> > mean  60         ...
> > ...
> > Thanks much,
> >
> > Ethan Xu
> >
> >
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message