hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ioan Eugen Stan <stan.ieu...@gmail.com>
Subject Re: Working with MapFiles
Date Mon, 02 Apr 2012 11:01:26 GMT
Hi Ondrej,

Pe 02.04.2012 13:00, Ondřej Klimpera a scris:
> Ok, thanks.
> I missed setup() method because of using older version of hadoop, so I
> suppose that method configure() does the same in hadoop 0.20.203.

Aha, if it's possible, try upgrading. I don't know how support is for 
versions older then hadoop 0.20 branch.

> Now I'm able to load a map file inside configure() method to
> MapFile.Reader instance as a class private variable, all works fine,
> just wondering if the MapFile is replicated on HDFS and data are read
> locally, or if reading from this file will increase the network
> bandwidth because of getting it's data from another computer node in the
> hadoop cluster.

You could use a method variable instead of a class private if you load 
the file. If the MapFile is wrote to HDFS then yes it is replicated, and 
you can configure the replication factor at file creation (and later 
maybe). If you use DistributedCache then the files are not written in 
HDFS, but in mapred.local.dir [1] folder on every node.
The folder size is configurable so it's possible that the data will be 
available there for the next MR job but don't rely on this.

Please read the docs, I may get things wrong. RTFM will save you life ;).

[1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
[2] https://forums.aws.amazon.com/message.jspa?messageID=152538

> Hopefully last question to bother you is, if reading files from
> DistributedCache (normal text file) is limited to particular job.
> Before running a job I add a file to DistCache. When getting the file in
> Reducer implementation, can it access DistCache files from another jobs?
> In another words what will list this command:
> //Reducer impl.
> public void configure(JobConf job) {
> URI[] distCacheFileUris = DistributedCache.getCacheFiles(job);
> }
> will the distCacheFileUris variable contain only URIs for this job, or
> for any job running on Hadoop cluster?
> Hope it's understandable.
> Thanks.


Ioan Eugen Stan

View raw message