spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does
Date Fri, 07 Nov 2014 02:59:33 GMT


Xuefu Zhang commented on SPARK-4290:

Hi [~rxin], by "out of box", do you mean org.apache.hadoop.filecache.DistributedCache [1]?
This is a MapReduce client class, which is used when you submit a MR job. It basically tell
MR framework that your job needs these files put in distributed cache in order to run. Thus,
MR framework will copy these files to local file system of the tasks. The task can access
the local files via syslinks.

I don't know how this can be used out of box. First, Hive on Spark user may not have MR client
library. Secondly, there isn't MR framework that does the copying.

Do you have an example on how I might achieve this?


> Provide an equivalent functionality of distributed cache as MR does
> -------------------------------------------------------------------
>                 Key: SPARK-4290
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Xuefu Zhang
> MapReduce allows client to specify files to be put in distributed cache for a job and
the framework guarentees that the file will be available in local file system of a node where
a task of the job runs and before the tasks actually starts. While this might be achieved
with Yarn via hacks, it's not available in other clusters. It would be nice to have such an
equivalent functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be suitable in certain

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message