Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1724E7EC0 for ; Mon, 7 Nov 2011 18:21:17 +0000 (UTC) Received: (qmail 20593 invoked by uid 500); 7 Nov 2011 18:21:16 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 20554 invoked by uid 500); 7 Nov 2011 18:21:16 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 20520 invoked by uid 99); 7 Nov 2011 18:21:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 18:21:16 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 18:21:13 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 3F22139074 for ; Mon, 7 Nov 2011 18:20:52 +0000 (UTC) Date: Mon, 7 Nov 2011 18:20:52 +0000 (UTC) From: "Robert Joseph Evans (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <418285673.7369.1320690052260.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1043974362.44980.1320135992213.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3323) Add new interface for Distributed Cache, which special for Map or Reduce,but not Both. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145669#comment-13145669 ] Robert Joseph Evans commented on MAPREDUCE-3323: ------------------------------------------------ I have read through all of your patches and I have a few comments. # I don't really like the name of current.task.type.internal. It would be better to prefix it with mapreduce. # I think it is slightly faster to change {code}fileURI.toArray(new URI[0]){code} to {code}fileURI.toArray(new URI[fileURI.size()]){code}, but this is just a nit. # There are no tests in the patches. I know you have done some manual testing, but adding/updating the unit tests is important for this to be accepted in. # Have you tested add(Archive|File)ToClassPathFor(Map|Reduce)? They set "mapred.job.classpath.(archives|files)" so if you use these methods some of the entries in "mapred.job.classpath.(archives|files)" will not be valid # Why are you setting CACHE_(FILE|ARCHIVE)_FOR_(MAP|REDUCE)? It seems like you could just go off of the existence of CACHE_(ARCHIVES|FILES)_(MAP|REDUCE). # could you please add in the new user facing configuration keys to mapred-default.xml so that they are documented. > Add new interface for Distributed Cache, which special for Map or Reduce,but not Both. > --------------------------------------------------------------------------------------- > > Key: MAPREDUCE-3323 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3323 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distributed-cache, tasktracker > Affects Versions: 0.20.203.0 > Reporter: Azuryy(Chijiong) > Fix For: 0.20.203.0 > > Attachments: DistributedCache.patch, GenericOptionsParser.patch, JobClient.patch, TaskDistributedCacheManager.patch, TaskTracker.patch > > > We put some file into Distributed Cache, but sometimes, only Map or Reduce use thses cached files, not useful for both. but TaskTracker always download cached files from HDFS, if there are some little bit big files in cache, it's time expensive. > so, this patch add some new API in the DistributedCache.java as follow: > addArchiveToClassPathForMap > addArchiveToClassPathForReduce > addFileToClassPathForMap > addFileToClassPathForReduce > addCacheFileForMap > addCacheFileForReduce > addCacheArchiveForMap > addCacheArchiveForReduce > New API doesn't affect original interface. User can use these features like the following two methods: > 1) > hadoop job **** -files file1 -mapfiles file2 -reducefiles file3 -archives arc1 -maparchives arc2 -reduce archives arc3 > 2) > DistributedCache.addCacheFile(conf, file1); > DistributedCache.addCacheFileForMap(conf, file2); > DistributedCache.addCacheFileForReduce(conf, file3); > DistributedCache.addCacheArchives(conf, arc1); > DistributedCache.addCacheArchivesForMap(conf, arc2); > DistributedCache.addCacheFArchivesForReduce(conf, arc3); > These two methods have the same result, That's mean: > You put six files to the distributed cache: file1 ~ file3, arc1 ~ arc3, > but file1 and arc1 are cached for both map and reduce; > file2 and arc2 are only cached for map; > file3 and arc3 are only cached for reduce; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira