hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vipul sharma <sharmavipulw...@gmail.com>
Subject Re: Reading large files inside mappers
Date Wed, 12 Jan 2011 20:58:52 GMT
This is exactly what I wanted. Thanks Koji!

On Wed, Jan 12, 2011 at 12:57 PM, Koji Noguchi <knoguchi@yahoo-inc.com>wrote:

> http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html#DistributedCache
> Packaging inside a job jar would work but it would not get shared for
> multiple jobs.
> Using distributed cache, it would localize the copy on the tasktracker
> nodes and get shared among multiple jobs.
> Koji
> On 1/12/11 12:51 PM, "vipul sharma" <sharmavipulwork@gmail.com> wrote:
> I am writing a mapreduce job for converting web pages in attributes such as
> terms, ngrams, domains, regexs etc. These attributes terms, ngrams, domains
> etc are kept in seperate files and are pretty big files close to about 500M
> in total. All these files will be used by each mapper for converting a web
> page into its attributes. The process is basically if a term in file is also
> in web page then that attribute is passed to reducer. Process is called
> feature extraction in machine learning. I am wondering what is the best way
> to access these files from mappers. Should I store them on hdfs and open and
> read these files inside all mappers or should I package these inside job
> jar. I appreciate your help and thanks for the suggestions.
> Vipul Sharma

Vipul Sharma
sharmavipul AT gmail DOT com

View raw message