hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kiran Lonikar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7333) Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
Date Thu, 13 Nov 2014 06:32:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209358#comment-14209358
] 

Kiran Lonikar commented on HIVE-7333:
-------------------------------------

Thanks. Considering what Reynold said, I looked into the spark sql docs. Look at https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

It says the caching in columnar format (like the one Reynold was alluding to) is enabled by
calling cacheTable on the SchemaRDD. I think same is true from the SQL interface "CACHE TABLE
tableName" command. 

I think you can re-run your performance tests using this (after caching the tables this way).

I think looking the code of SchemaRDD.paraquetFile may also help in reading multiple rows
at the same so performance improves even when reading.

Using vectorization has another benefit that it can run on GPUs.

> Create RDD translator, translating Hive Tables into Spark RDDs [Spark Branch]
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-7333
>                 URL: https://issues.apache.org/jira/browse/HIVE-7333
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Rui Li
>              Labels: Spark-M1
>
> Please refer to the design specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message