hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Choi <mp2...@gmail.com>
Subject Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Fri, 10 Dec 2010 12:23:43 GMT
Wow thanks for the info. I'll definitely try that. 
One question though...
Is that "tagged name"and "free indicator" some kind of special class variable provided by
MultipleOutputs class?

Ed

From mp2893's iPhone

On 2010. 12. 10., at 오후 5:30, Harsh J <qwertymaniac@gmail.com> wrote:

> Hi,
> 
> You can use MultipleOutputs class to achieve this, with tagged names
> and free indicators of whether the output was from a map or reduce
> also.
> 
> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <mp2893@gmail.com> wrote:
>> Hi,
>> 
>> I'm trying to crawl numerous news sites.
>> My plan is to make a file containing a list of all the news rss feed urls,
>> and the path to save the crawled news article.
>> So it would be like this:
>> 
>> nytimes_nation,    /user/hadoop/nytimes
>> nytimes_sports,    /user/hadoop/nytimes
>> latimes_world,      /user/hadoop/latimes
>> latimes_nation,     /user/hadoop/latimes
>> ...
>> ...
>> ...
>> 
>> Each mapper would get a single line and crawl the assigned url, process
>> text, and save the result.
>> So this job does not need any Reducing process.
>> 
>> But what I'd also like to do is to create a dictionary at the same time.
>> This could definitely take advantage of Reduce phase. Each mapper can
>> generate output as "KEY:term, VALUE:term_frequency"
>> Then Reducer can merge them all together and create a dictionary. (Of course
>> I would be using many Reducers so the dictionary would be partitioned)
>> 
>> I know that I can do this by creating two separate jobs (one for crawling,
>> the other for making dictionary), but I'd like to do this in one-pass.
>> 
>> So my design is:
>> Map phase ==> crawl news articles, process text, write the result to a file.
>>        II
>>        II     pass (term, term_frequency) pair to the Reducer
>>        II
>>        V
>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
>> dictionary
>> 
>> Is this at all possible? Or is it inherently impossible due to the structure
>> of Hadoop?
>> If it's possible, could anyone tell me how to do it?
>> 
>> Ed.
>> 
> 
> 
> 
> -- 
> Harsh J
> www.harshj.com

Mime
View raw message