hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From edward choi <mp2...@gmail.com>
Subject Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Fri, 10 Dec 2010 07:27:42 GMT

I'm trying to crawl numerous news sites.
My plan is to make a file containing a list of all the news rss feed urls,
and the path to save the crawled news article.
So it would be like this:

nytimes_nation,    /user/hadoop/nytimes
nytimes_sports,    /user/hadoop/nytimes
latimes_world,      /user/hadoop/latimes
latimes_nation,     /user/hadoop/latimes

Each mapper would get a single line and crawl the assigned url, process
text, and save the result.
So this job does not need any Reducing process.

But what I'd also like to do is to create a dictionary at the same time.
This could definitely take advantage of Reduce phase. Each mapper can
generate output as "KEY:term, VALUE:term_frequency"
Then Reducer can merge them all together and create a dictionary. (Of course
I would be using many Reducers so the dictionary would be partitioned)

I know that I can do this by creating two separate jobs (one for crawling,
the other for making dictionary), but I'd like to do this in one-pass.

So my design is:
Map phase ==> crawl news articles, process text, write the result to a file.
        II     pass (term, term_frequency) pair to the Reducer
Reduce phase ==> Merge the (term, term_frequency) pair and create a

Is this at all possible? Or is it inherently impossible due to the structure
of Hadoop?
If it's possible, could anyone tell me how to do it?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message