spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From adrian-ionescu <>
Subject [GitHub] spark pull request #18884: [SPARK-21669] Internal API for collecting metrics...
Date Tue, 08 Aug 2017 14:56:10 GMT
GitHub user adrian-ionescu opened a pull request:

    [SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs

    ## What changes were proposed in this pull request?
    This patch introduces an internal interface for tracking metrics and/or statistics on
data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially
reimplements SPARK-20703 in terms of it.
    The interface basically consists of 3 traits:
    - `WriteTaskStats`: just a tag for classes that represent statistics collected during
a `WriteTask`
      The only constraint it adds is that the class should be `Serializable`, as instances
of it will be collected on the driver from all executors at the end of the `WriteJob`.
    - `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based
on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats`
    - `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable`
state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally
process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver.
    Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO
table ... ` operations.
    ## How was this patch tested?
    Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`,

You can merge this pull request into a Git repository by running:

    $ git pull write-stats-tracker-api

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18884
commit 67e333e7abfd96b8f80bb0a128088d70f995d864
Author: Adrian Ionescu <>
Date:   2017-08-07T12:57:02Z


commit 176726e7139121d0ffc9d0817b256b831a8c4fc8
Author: Adrian Ionescu <>
Date:   2017-08-07T14:22:24Z

    tests pass; missing docs

commit 6f402468f72fcbdacc680dcae0fafb9fd340ad9f
Author: Adrian Ionescu <>
Date:   2017-08-07T19:14:49Z

    newPartition() takes InternalRow instead of String

commit e6ab459501d70180d53a41dff69bdc13157df5a5
Author: Adrian Ionescu <>
Date:   2017-08-08T12:56:54Z

    bug fix + docs

commit 3665f2fb4331012a022e9ae70cbe3d480ab8dcd3
Author: Adrian Ionescu <>
Date:   2017-08-08T14:51:36Z



If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message