hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Mittal <amitmitt...@gmail.com>
Subject Re: DistributedCache deprecated
Date Thu, 30 Jan 2014 13:18:16 GMT
Hi Prav,

You are correct, thanks for the explanation. As per below link, I can see
that Job's method internally calls to DistributedCache itself (
after ensuring state, I think that might be the reason. Here is one of the

1067 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1067>


  public void  <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#>addCacheFile(URI
uri) {

1068 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1068>



1069 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1069>


conf <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/task/JobContextImpl.java#JobContextImpl.0conf>);

1070 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1070>




On Thu, Jan 30, 2014 at 6:19 PM, praveenesh kumar <praveenesh@gmail.com>wrote:

> Hi Amit,
> Side data distribution is altogether a different concept at all. Its when
> you set custom (key,value) pairs and use Job object for doing that, so that
> you can use them in your mappers/reducers. It is good when you want to pass
> some small information to your mappers/reducers like extra command line
> arguments that is required by mappers/reducers.
> We were not discussing Side data distribution at all.
> The question was DistributedCache gets deprecated, where we can find the
> right methods which DistributedCache delivers.
> If you see the DistributedCache class in MR v1 -
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html
> and compare it with Job class in MR v2 -
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
> You would see the methods of DistributedCache class has been added to Job
> class. Since DistributedCache is deprecated, my guess was that we can use
> Job class to use distributed cache using the same methods which
> DistributedCache used to provide.
> Everything else is same, its just that you use Job class to set your files
> for Distributed cache inside your job configuration. Well I am sorry. I
> don't have any nice article as I said that I also did this as part of my
> experiment and I was able to use it without any issues, so that's why I
> suggested it.
> Since most of the developers still using MRv1 on hadoop 2.0, that is why
> these changes have not been come into highlights so far. I am hoping a new
> documentation on how to use MRv2 would come soon, but if you understand
> MRv1, I don't see any reasons why can't you just move around a bit in API
> and find your relevant classes that you want to use by yourself.  Again, as
> I said, I don't have any valid statements of what I am saying, they are
> just the results of my own experiments, which you are most welcome to
> conduct and play with. Happy Coding..!!
> Regards
> Prav
> On Thu, Jan 30, 2014 at 12:27 PM, Amit Mittal <amitmittal5@gmail.com>wrote:
>> Hi Prav,
>> Yes, you are correct that DistributedCache does not upload file into
>> memory. Also using job configuration and DistributedCache are 2 different
>> approaches. I am referring based on "Hadoop: The definitive guide"
>> Chapter:8 > Side Data Distribution (Page 288-295).
>> As you are saying that now methods of DistributedCache moved to Job, I
>> request if you please share some article or document on that for my better
>> understanding, it will be great help.
>> Thanks
>>  Amit
>> On Thu, Jan 30, 2014 at 5:35 PM, praveenesh kumar <praveenesh@gmail.com>wrote:
>>> Hi Amit,
>>> I am not sure how are they linked with DistributedCache.. Job
>>> configuration is not uploading any data in memory.. As far as I am aware of
>>> how DistributedCache works, nothing get loaded in memory. Distributed cache
>>> just copies the files into slave nodes, so that they are accessible to
>>> mappers/reducers. Usually the location is
>>> ${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
>>> distribution to distribution) You always have to read the files in your
>>> mapper or reducer when ever you want to use them.
>>> What has happened is the method of DistributedCache class has now been
>>> added to Job class, and I am assuming they won't change the functionality
>>> of how distributed cache methods used to work, otherwise there would have
>>> been some nice articles on that, plus I don't see any reason of changing
>>> that as well too..  so everything works still the same way.. Its just that
>>> you use the new Job class to use distributed cache features.
>>> I am not sure what entries you are exactly pointing to. Am I missing
>>> anything here ?
>>> Regards
>>> Prav
>>> On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal <amitmittal5@gmail.com>wrote:
>>>> Hi Mike & Prav,
>>>> Although I am new to Hadoop, but would like to add my 2 cents if that
>>>> helps.
>>>> We are having 2 ways for distribution of shared data, one is using Job
>>>> configuration and other is DistributedCache.
>>>> As job configuration is read by the JT, TT and child JVMs, and each
>>>> time the configuration is read, all of its entries are read in memory, even
>>>> if they are not used. So using job configuration is not advised if the data
>>>> is more than few kilobytes. So it is not alternative to DistributedCache
>>>> unless some modifications are done in Job configuration to address this
>>>> limitation.
>>>> So I am also curious to know the alternatative to DistributedCache
>>>> class.
>>>> Thanks
>>>> Amit
>>>> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
>>>> Michael.Giordano@vistronix.com> wrote:
>>>>>  I noticed that in Hadoop 2.2.0
>>>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>>>> Is there a class that provides equivalent functionality? My
>>>>> application relies heavily on DistributedCache.
>>>>> Thanks,
>>>>> Mike G.
>>>>> This communication, along with its attachments, is considered
>>>>> confidential and proprietary to Vistronix.  It is intended only for the
>>>>> of the person(s) named above.  Note that unauthorized disclosure or
>>>>> distribution of information not generally known to the public is strictly
>>>>> prohibited.  If you are not the intended recipient, please notify the
>>>>> sender immediately.

View raw message