hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Passing data files via the distributed cache
Date Mon, 28 Nov 2011 16:34:44 GMT
There is currently no way to delete the data from the cache when you are done.  It is garbage
collected when the cache starts to fill up (in LRU order if you are on a newer release). 
The DistributedCache.addCacheFile is modifying the JobConf behind the scenes for you.  If
you want to dig into the details of what it is doing you can look at the source code for it.

--Bobby Evans

On 11/28/11 4:46 AM, "Andy Doddington" <andy@doddington.net> wrote:

Thanks for that link Prashant - very useful.

Two brief follow-up questions:

1) Having put data in the cache, I would like to be a good citizen by deleting the data from
the cache once
    I've finished - how do I do that?
2) Would it be simpler to pass the data as a value in the jobConf object?


        Andy D.

On 25 Nov 2011, at 12:14, Prashant Kommireddi wrote:

> I believe you want to ship data to each node in your cluster before MR
> begins so the mappers can access files local to their machine. Hadoop
> tutorial on YDN has some good info on this.
> http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
> -Prashant Kommireddi
> On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington <andy@doddington.net>wrote:
>> I have a series of mappers that I would like to be passed data using the
>> distributed cache mechanism. At the
>> moment, I am using HDFS to pass the data, but this seems wasteful to me,
>> since they are all reading the same data.
>> Is there a piece of example code that shows how data files can be placed
>> in the cache and accessed by mappers?
>> Thanks,
>>       Andy Doddington

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message