spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From concretevitamin <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Date Mon, 14 Jul 2014 18:18:32 GMT
GitHub user concretevitamin opened a pull request:

    https://github.com/apache/spark/pull/1408

    [SPARK-2443][SQL] Fix slow read from partitioned tables

    This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390)
by moving an array update and deserializer initialization out of a potentially very long loop.
Suggested by @yhuai. The below results are updated for this fix.
    
    ## Benchmarks
    Generated a local text file with 10M rows of simple key-value pairs. The data is loaded
as a table through Hive. Results are obtained on my local machine using hive/console.
    
    Without the fix:
    
    Type | Non-partitioned | Partitioned (1 part)
    ------------ | ------------ | -------------
    First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s)
    Stablized runs | 1.21s (1.18s) | 27.6s (27.5s)
    
    With this fix:
    
    Type | Non-partitioned | Partitioned (1 part)
    ------------ | ------------ | -------------
    First run | 9.57s (1.46s) | 11.0s (1.69s)
    Stablized runs | 1.13s (1.10s) | 1.23s (1.19s)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/concretevitamin/spark slow-read-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1408.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1408
    
----
commit d86e437218f99179934ccd9b4d5d89c02b09459d
Author: Zongheng Yang <zongheng.y@gmail.com>
Date:   2014-07-14T18:03:07Z

    Move update & initialization out of potentially long loop.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message