hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-8851) Broadcast files for small tables via SparkContext.addFile() and SparkFiles.get() [Spark Branch]
Date Fri, 06 Mar 2015 22:19:38 GMT

     [ https://issues.apache.org/jira/browse/HIVE-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jimmy Xiang updated HIVE-8851:
------------------------------
    Attachment: HIVE-8851.2-spark.patch

Attached patch v2 that uses the add-folder feature in Spark 1.3.

> Broadcast files for small tables via SparkContext.addFile() and SparkFiles.get() [Spark
Branch]
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8851
>                 URL: https://issues.apache.org/jira/browse/HIVE-8851
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Jimmy Xiang
>             Fix For: spark-branch
>
>         Attachments: HIVE-8851.1-spark.patch, HIVE-8851.2-spark.patch
>
>
> Currently files generated by SparkHashTableSinkOperator for small tables are written
directly on HDFS with a high replication factor. When map join happens, map join operator
is going to load these files into hash tables. Since on multiple partitions can be process
on the same worker node, reading the same set of files multiple times are not ideal. The improvment
can be done by calling SparkContext.addFiles() on these files, and use SparkFiles.getFile()
to download them to the worker node just once.
> Please note that SparkFiles.getFile() is a static method. Code invoking this method needs
to be in a static method. This calling method needs to be synchronized because it may get
called in different threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message