hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Mittal <amitmitt...@gmail.com>
Subject Re: DistributedCache deprecated
Date Thu, 30 Jan 2014 13:18:16 GMT
Hi Prav,

You are correct, thanks for the explanation. As per below link, I can see
that Job's method internally calls to DistributedCache itself (
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.addCacheFile%28java.net.URI%29)
after ensuring state, I think that might be the reason. Here is one of the
method:

1067 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1067>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#>

  public void  <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#>addCacheFile(URI
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/net/URI.java#URI>
uri) {

1068 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1068>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#>

    ensureState
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.ensureState%28org.apache.hadoop.mapreduce.Job.JobState%29>(JobState
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.JobState.0DEFINE>.DEFINE
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.JobState.0DEFINE>);

1069 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1069>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#>

    DistributedCache.addCacheFile
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/filecache/DistributedCache.java#DistributedCache.addCacheFile%28java.net.URI%2Corg.apache.hadoop.conf.Configuration%29>(uri,
conf <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/task/JobContextImpl.java#JobContextImpl.0conf>);

1070 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1070>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#>

  }


Thanks
Amit


On Thu, Jan 30, 2014 at 6:19 PM, praveenesh kumar <praveenesh@gmail.com>wrote:

> Hi Amit,
>
> Side data distribution is altogether a different concept at all. Its when
> you set custom (key,value) pairs and use Job object for doing that, so that
> you can use them in your mappers/reducers. It is good when you want to pass
> some small information to your mappers/reducers like extra command line
> arguments that is required by mappers/reducers.
> We were not discussing Side data distribution at all.
>
> The question was DistributedCache gets deprecated, where we can find the
> right methods which DistributedCache delivers.
> If you see the DistributedCache class in MR v1 -
>
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html
>
> and compare it with Job class in MR v2 -
>
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html
>
> You would see the methods of DistributedCache class has been added to Job
> class. Since DistributedCache is deprecated, my guess was that we can use
> Job class to use distributed cache using the same methods which
> DistributedCache used to provide.
>
> Everything else is same, its just that you use Job class to set your files
> for Distributed cache inside your job configuration. Well I am sorry. I
> don't have any nice article as I said that I also did this as part of my
> experiment and I was able to use it without any issues, so that's why I
> suggested it.
>
> Since most of the developers still using MRv1 on hadoop 2.0, that is why
> these changes have not been come into highlights so far. I am hoping a new
> documentation on how to use MRv2 would come soon, but if you understand
> MRv1, I don't see any reasons why can't you just move around a bit in API
> and find your relevant classes that you want to use by yourself.  Again, as
> I said, I don't have any valid statements of what I am saying, they are
> just the results of my own experiments, which you are most welcome to
> conduct and play with. Happy Coding..!!
>
> Regards
> Prav
>
>
>
>
> On Thu, Jan 30, 2014 at 12:27 PM, Amit Mittal <amitmittal5@gmail.com>wrote:
>
>> Hi Prav,
>>
>> Yes, you are correct that DistributedCache does not upload file into
>> memory. Also using job configuration and DistributedCache are 2 different
>> approaches. I am referring based on "Hadoop: The definitive guide"
>> Chapter:8 > Side Data Distribution (Page 288-295).
>> As you are saying that now methods of DistributedCache moved to Job, I
>> request if you please share some article or document on that for my better
>> understanding, it will be great help.
>>
>> Thanks
>>  Amit
>>
>>
>> On Thu, Jan 30, 2014 at 5:35 PM, praveenesh kumar <praveenesh@gmail.com>wrote:
>>
>>> Hi Amit,
>>>
>>> I am not sure how are they linked with DistributedCache.. Job
>>> configuration is not uploading any data in memory.. As far as I am aware of
>>> how DistributedCache works, nothing get loaded in memory. Distributed cache
>>> just copies the files into slave nodes, so that they are accessible to
>>> mappers/reducers. Usually the location is
>>> ${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
>>> distribution to distribution) You always have to read the files in your
>>> mapper or reducer when ever you want to use them.
>>>
>>> What has happened is the method of DistributedCache class has now been
>>> added to Job class, and I am assuming they won't change the functionality
>>> of how distributed cache methods used to work, otherwise there would have
>>> been some nice articles on that, plus I don't see any reason of changing
>>> that as well too..  so everything works still the same way.. Its just that
>>> you use the new Job class to use distributed cache features.
>>>
>>> I am not sure what entries you are exactly pointing to. Am I missing
>>> anything here ?
>>>
>>>
>>> Regards
>>> Prav
>>>
>>>
>>> On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal <amitmittal5@gmail.com>wrote:
>>>
>>>> Hi Mike & Prav,
>>>>
>>>> Although I am new to Hadoop, but would like to add my 2 cents if that
>>>> helps.
>>>> We are having 2 ways for distribution of shared data, one is using Job
>>>> configuration and other is DistributedCache.
>>>> As job configuration is read by the JT, TT and child JVMs, and each
>>>> time the configuration is read, all of its entries are read in memory, even
>>>> if they are not used. So using job configuration is not advised if the data
>>>> is more than few kilobytes. So it is not alternative to DistributedCache
>>>> unless some modifications are done in Job configuration to address this
>>>> limitation.
>>>> So I am also curious to know the alternatative to DistributedCache
>>>> class.
>>>>
>>>> Thanks
>>>> Amit
>>>>
>>>>
>>>>
>>>> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
>>>> Michael.Giordano@vistronix.com> wrote:
>>>>
>>>>>  I noticed that in Hadoop 2.2.0
>>>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>>>>
>>>>>
>>>>>
>>>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>>>>
>>>>>
>>>>>
>>>>> Is there a class that provides equivalent functionality? My
>>>>> application relies heavily on DistributedCache.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mike G.
>>>>>
>>>>> This communication, along with its attachments, is considered
>>>>> confidential and proprietary to Vistronix.  It is intended only for the
use
>>>>> of the person(s) named above.  Note that unauthorized disclosure or
>>>>> distribution of information not generally known to the public is strictly
>>>>> prohibited.  If you are not the intended recipient, please notify the
>>>>> sender immediately.
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message