hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Fri, 10 Dec 2010 08:30:49 GMT

You can use MultipleOutputs class to achieve this, with tagged names
and free indicators of whether the output was from a map or reduce

On Fri, Dec 10, 2010 at 12:57 PM, edward choi <mp2893@gmail.com> wrote:
> Hi,
> I'm trying to crawl numerous news sites.
> My plan is to make a file containing a list of all the news rss feed urls,
> and the path to save the crawled news article.
> So it would be like this:
> nytimes_nation,    /user/hadoop/nytimes
> nytimes_sports,    /user/hadoop/nytimes
> latimes_world,      /user/hadoop/latimes
> latimes_nation,     /user/hadoop/latimes
> ...
> ...
> ...
> Each mapper would get a single line and crawl the assigned url, process
> text, and save the result.
> So this job does not need any Reducing process.
> But what I'd also like to do is to create a dictionary at the same time.
> This could definitely take advantage of Reduce phase. Each mapper can
> generate output as "KEY:term, VALUE:term_frequency"
> Then Reducer can merge them all together and create a dictionary. (Of course
> I would be using many Reducers so the dictionary would be partitioned)
> I know that I can do this by creating two separate jobs (one for crawling,
> the other for making dictionary), but I'd like to do this in one-pass.
> So my design is:
> Map phase ==> crawl news articles, process text, write the result to a file.
>        II
>        II     pass (term, term_frequency) pair to the Reducer
>        II
>        V
> Reduce phase ==> Merge the (term, term_frequency) pair and create a
> dictionary
> Is this at all possible? Or is it inherently impossible due to the structure
> of Hadoop?
> If it's possible, could anyone tell me how to do it?
> Ed.

Harsh J

View raw message