hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject RE: Distributed cache - are files unique per job?
Date Wed, 30 Sep 2009 05:22:37 GMT
I believe framework checks timestamps on HDFS for marking an already available copy of the
file valid or invalid, since the archived files are not cleaned up till a certain du limit
is reached, and no apis for cleanup available. There was a thread on this some time back on
the list.


-----Original Message-----
From: Allen Wittenauer [mailto:awittenauer@linkedin.com] 
Sent: Tuesday, September 29, 2009 10:41 PM
To: common-user@hadoop.apache.org
Subject: Re: Distributed cache - are files unique per job?

On 9/29/09 2:55 AM, "Erik Forsberg" <forsberg@opera.com> wrote:
> If I distribute files using the Distributed Cache (-archives option),
> are they guaranteed to be unique per job, or is there a risk that if I
> distribute a file named A with job 1, job 2 which also distributes a
> file named A will read job 1's file?

>From my understanding, at one point in time there was a 'shortcut' in the
system that did exactly what you fear.  If the same cache file name was
specified by multiple jobs, they'd get the same file as it was assumed they
were the same file.  I *think* this has been fixed though.

[Needless to say, for automated jobs that push security keys through a cache
file, this is bad.]

View raw message