spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cold gin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns
Date Thu, 05 Oct 2017 13:20:02 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192834#comment-16192834
] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:19 PM:
-----------------------------------------------------------

Yes, it is only the default behavior that I think should be reversed; I don't have a problem
at all with supporting the stats for strings. If you look the default output of the describe()
api, it produces several fields (count, mean, stddev, etc) - by default. For all of those
output attributes to be populated *by default* you must include only numeric columns. This
simple evidence of what the default output produces is my argument for what should be included
as the default input. Supporting string columns imo is fine, but should be controlled with
an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't have a problem
at all with supporting the stats for strings. If you look the *default* output of the describe()
api, it produces several fields (count, mean, stddev, etc) - BY DEFAULT. For all of those
output attributes to be populated *by default* you must include only numeric columns. This
simple evidence of what the default output produces is my argument for what should be included
as default input. Supporting string columns imo is fine, but should be controlled with an
includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> ------------------------------------------
>
>                 Key: SPARK-22201
>                 URL: https://issues.apache.org/jira/browse/SPARK-22201
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() function should
only include numerical column types, but it is including String types as well. This creates
unusable statistical results (for example, max returns "V8903" for one of the string columns
in my dataset), and this also leads to stacktraces when you run show() on the resulting dataframe
returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only Int, Double,
etc (numeric) columns should be included when generating the statistics. 
> Perhaps this reveals the need for a new function to produce stats that make sense only
for string columns, or else an additional parameter to describe() to filter in/out certain
column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not include
string columns. Note that boolean columns are correctly excluded by describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message