spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
Date Tue, 01 May 2018 21:14:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460164#comment-16460164
] 

Joseph K. Bradley commented on SPARK-23686:
-------------------------------------------

[~yogeshgarg] made the good point that we should not convert all uses of Logging to use Instrumentation:
if logging happens on executors, then we should not use the (non-serializable) Instrumentation
class.  E.g.: https://github.com/apache/spark/blob/6782359a04356e4cde32940861bf2410ef37f445/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1587
Also, these instances all seem to be at the DEBUG level, which is not really useful for users.

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -------------------------------------------------------------
>
>                 Key: SPARK-23686
>                 URL: https://issues.apache.org/jira/browse/SPARK-23686
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for more specific
tasks.
> I've noticed that we don't make the best usage of the instrumentation class. Specifically sometimes
we bypass the instrumentation class and use the debugger instead. For example, [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation class that
we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially when this
data is already available we can log it for free. For example, Logistic Regression Summarizer
computes some useful data including numRows that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message