Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Tue, 25 Feb 2014 22:17:27 +0000 (UTC)
From: "Karthik Kambatla (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12652263.1370980832492.104011.1393366647208@arcas>
In-Reply-To: <JIRA.12652263.1370980832492@arcas>
References: <JIRA.12652263.1370980832492@arcas>
Subject: [jira] [Commented] (YARN-1492) truly shared cache for jars
 (jobjar/libjar)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13912144#comment-13912144 ] 

Karthik Kambatla commented on YARN-1492:
----------------------------------------

Thanks for sharing this, [~ctrezzo]. The document is nicely written. Few comments:
* Would SCM be a single point of failure? If yes, would anyone of the following approaches make sense.
** Make SCM an AM. From YARN-896, the only sub-task that affects this would be the delegation tokens. 
** Add an SCMMonitorService to the RM. If SCM is enabled, this service would start the SCM on one of the nodes and monitor it. 
* SCM Cleaner Service - the doc mentions the tension between frequency of cleaner and load on the RM. Can you elaborate? I was of the opinion that the RM is not involved in the caching at all. 
* Cleaner protocol doesn't mention when the cleaner lock is cleared. I assume it is cleared on each exit path. 
* Nit: ZK-based store - we can may be do this in the JIRA corresponding to the sub-task - how would this look like? 
* More nit-picking: The rationale for not using in-memory and reconstructing seems to come from long-running applications. Given long-running applications don't benefit from the shared cache as much as the shorter ones, is this a huge concern? 

> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
>                 Key: YARN-1492
>                 URL: https://issues.apache.org/jira/browse/YARN-1492
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of "bringing compute to where data is". This is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion.


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)