hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Choi <mp2...@gmail.com>
Subject Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Fri, 10 Dec 2010 18:15:28 GMT
God I never knew that they had a project like this. 
I should definitely check it out. I may even be able to use it at my work place. Thanks for
the tip!!

From mp2893's iPhone

On 2010. 12. 10., at 오후 10:36, "Jones, Nick" <nick.jones@amd.com> wrote:

> It might be worth looking into Nutch; it can probably be configured to do the type of
crawling you need.
> 
> Nick Jones
> 
> -----Original Message-----
> From: Edward Choi [mailto:mp2893@gmail.com] 
> Sent: Friday, December 10, 2010 6:24 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Is it possible to write file output in Map phase once and write another
file output in Reduce phase?
> 
> Wow thanks for the info. I'll definitely try that. 
> One question though...
> Is that "tagged name"and "free indicator" some kind of special class variable provided
by MultipleOutputs class?
> 
> Ed
> 
> From mp2893's iPhone
> 
> On 2010. 12. 10., at 오후 5:30, Harsh J <qwertymaniac@gmail.com> wrote:
> 
>> Hi,
>> 
>> You can use MultipleOutputs class to achieve this, with tagged names
>> and free indicators of whether the output was from a map or reduce
>> also.
>> 
>> On Fri, Dec 10, 2010 at 12:57 PM, edward choi <mp2893@gmail.com> wrote:
>>> Hi,
>>> 
>>> I'm trying to crawl numerous news sites.
>>> My plan is to make a file containing a list of all the news rss feed urls,
>>> and the path to save the crawled news article.
>>> So it would be like this:
>>> 
>>> nytimes_nation,    /user/hadoop/nytimes
>>> nytimes_sports,    /user/hadoop/nytimes
>>> latimes_world,      /user/hadoop/latimes
>>> latimes_nation,     /user/hadoop/latimes
>>> ...
>>> ...
>>> ...
>>> 
>>> Each mapper would get a single line and crawl the assigned url, process
>>> text, and save the result.
>>> So this job does not need any Reducing process.
>>> 
>>> But what I'd also like to do is to create a dictionary at the same time.
>>> This could definitely take advantage of Reduce phase. Each mapper can
>>> generate output as "KEY:term, VALUE:term_frequency"
>>> Then Reducer can merge them all together and create a dictionary. (Of course
>>> I would be using many Reducers so the dictionary would be partitioned)
>>> 
>>> I know that I can do this by creating two separate jobs (one for crawling,
>>> the other for making dictionary), but I'd like to do this in one-pass.
>>> 
>>> So my design is:
>>> Map phase ==> crawl news articles, process text, write the result to a file.
>>>       II
>>>       II     pass (term, term_frequency) pair to the Reducer
>>>       II
>>>       V
>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
>>> dictionary
>>> 
>>> Is this at all possible? Or is it inherently impossible due to the structure
>>> of Hadoop?
>>> If it's possible, could anyone tell me how to do it?
>>> 
>>> Ed.
>>> 
>> 
>> 
>> 
>> -- 
>> Harsh J
>> www.harshj.com
> 

Mime
View raw message