spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Narine Kokhlikyan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions
Date Mon, 14 Dec 2015 20:52:46 GMT
Narine Kokhlikyan created SPARK-12325:
-----------------------------------------

             Summary: Inappropriate error messages in DataFrame StatFunctions 
                 Key: SPARK-12325
                 URL: https://issues.apache.org/jira/browse/SPARK-12325
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Narine Kokhlikyan
            Priority: Critical


Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL component, but I've
never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns
with dataType StringType not supported.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and covariance.
This might be a convenient way
from certain perspective, however something like this is harder to understand and extend,
especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and covariance

and many more .... 

Since I'm not getting any response and according to github all five of you have been working
on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about this.
In case you are planning to remove it or something else, we'd truly appreciate if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components
are just staying there without any response, I'll wait for your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message