Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Fri, 14 Feb 2014 21:58:33 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12652263.1370980832492.46266.1392415113132@arcas>
In-Reply-To: <JIRA.12652263.1370980832492@arcas>
References: <JIRA.12652263.1370980832492@arcas>
Subject: [jira] [Commented] (YARN-1492) truly shared cache for jars
 (jobjar/libjar)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902003#comment-13902003 ] 

Jason Lowe commented on YARN-1492:
----------------------------------

Thanks for posting the new design, Chris.  Comments:

- The public localizer will only localize files that are publicly available, however the staging directory is not publicly available.  Clients must upload publicly localized files elsewhere in order for that to work, but files outside of the staging directory won't be automatically cleaned when the job exits.
- There's a race between the NM uploading the file to the shared cache area and the local dist cache cleaner removing the local file.
- How parallel will the NM upload process be -- is it serially uploading the resources for each container and between containers?
- Is the cleaner running as part of the SCM?  If so I don't think it necessary to store the cleaner flag in the persisted state, and that would be a bit less traffic to the store while cleaning.
- It might be nice to provide a simpler store setup for the SCM for smaller clusters or those not already using ZK for other things (e.g.: HA)  Something like a leveldb store or simple local filesystem storage would suffice since those don't require separate setup.
- The cleaner should handle files that are orphaned in the cache if the NM fails to complete the upload.  Could use a timeout based on the file timestamp or other mechanisms to accomplish this.
- What criteria will clients use to decide if files are public?  As-is this doesn't seem to address the original goals of the JIRA since hardly anything is declared public unless already in a well-known place in HDFS today. I'd like the design to also state any proposed changes to the behavior of the job submitter's handling of the dist cache during job submission if there are any.
- Nit: It should be made clearer that the client cannot notify the SCM that an application is not using a resource until the application has completed, or we risk the cleaner removing the resource while it is still in use by the application.  The client protocol steps read as if the client can submit and then immediately notify the SCM if desired.


> truly shared cache for jars (jobjar/libjar)
> -------------------------------------------
>
>                 Key: YARN-1492
>                 URL: https://issues.apache.org/jira/browse/YARN-1492
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of "bringing compute to where data is". This is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion.


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)