spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yangw1234 <>
Subject [GitHub] spark pull request #15096: [SPARK-17537] [SQL] Reading parquet schema from d...
Date Wed, 14 Sep 2016 15:12:03 GMT
GitHub user yangw1234 opened a pull request:

    [SPARK-17537] [SQL] Reading parquet schema from driver directly when there is only one
file to touch

    ## What changes were proposed in this pull request?
    `"parquet/dir")` would issue a spark job to read parquet schema. When
`spark.sql.parquet.mergeSchema` are set to false (the default value), there is often only
one file to read, so there is no need to issue a spark job to do it. In this case, we can
read it from driver directly instead of issuing a spark job. This could reduce the infer schema
latency from several hundreds milliseconds to around ten milliseconds in my environment.
    ## How was this patch tested?
    manually tested

You can merge this pull request into a Git repository by running:

    $ git pull mergeSchema_2.0

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15096
commit 91081bccf4326e8261dd41035c199fbd8f12c69d
Author: wangyang <>
Date:   2016-09-14T13:12:24Z

    opt merge schema

commit 9b610c22e8fedbf61b98943ca9a8b3ba08320c94
Author: wangyang <>
Date:   2016-09-14T13:20:49Z

    fix typo


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message