spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Sasaki (JIRA)" <>
Subject [jira] [Commented] (SPARK-8419) Statistics.colStats could avoid an extra count()
Date Sun, 21 Jun 2015 13:48:00 GMT


Kai Sasaki commented on SPARK-8419:

In the {{Statistics#colStats}}, the number of rows seems to be updated in {{computeColumnSummaryStatistics}}
with {{updateNumRows}}. This is computed through distributed process which is calculated inside
of {{RDD#treeAggregate}}. So I think there is no extra {{count()}} when just only creating
{{RowMatrix}}. Is this assumption correct?

> Statistics.colStats could avoid an extra count()
> ------------------------------------------------
>                 Key: SPARK-8419
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Trivial
>              Labels: starter
> Statistics.colStats goes through RowMatrix to compute the stats.  But RowMatrix.computeColumnSummaryStatistics
does an extra count() which could be avoided.  Not going through RowMatrix would skip this
extra pass over the data.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message