spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dong Joon Hyun <dh...@hortonworks.com>
Subject Re: Faster Spark on ORC with Apache ORC
Date Fri, 12 May 2017 17:49:17 GMT
Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


    Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
    SQL Single Int Column Scan:         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    215 /  262         73.0          13.7      
1.0X
    SQL Parquet MR                           1946 / 2083          8.1         123.7      
0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is improved much more
in general.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
  Relative
    ------------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                         102 /  123        153.7           6.5 
     1.0X
    SQL Parquet MR                                 409 /  436         38.5          26.0 
     0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the following.


    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
  Relative
    ------------------------------------------------------------------------------------------------
    SQL ORC Vectorized                             147 /  153        107.3           9.3 
     1.0X
    SQL ORC MR                                     338 /  369         46.5          21.5 
     0.4X
    HIVE ORC MR                                    408 /  424         38.6          25.9 
     0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems to catch up
much.


Bests,
Dongjoon.


From: Dongjoon Hyun <dhyun@hortonworks.com<mailto:dhyun@hortonworks.com>>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits.

    - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization
support.

    - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community
effort in the future.

    - Usability: Users can use `ORC` data sources without hive module (-Phive)

    - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code
from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.
Mime
View raw message