hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Choi <mp2...@gmail.com>
Subject Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Fri, 10 Dec 2010 18:20:42 GMT
Thanks for the tip. I guess it's a little different project from Nutch. My understanding is
that while Nutch tries to implement a whole web search package, Bixo focuses on the crawling
part. I should look into both projects more deeply. Thanks again!!


From mp2893's iPhone

On 2010. 12. 11., at 오전 1:15, Ted Dunning <tdunning@maprtech.com> wrote:

> That is definitely possible, but may not be very desirable.
> Take a look at the Bixo project for a full-scale crawler.  There is a lot of
> subtlety in the fetching of URL's
> due to the varying quality of different sites and the interaction with crawl
> choking due to robots.txt considerations.
> http://bixo.101tec.com/
> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <mp2893@gmail.com> wrote:
>> So my design is:
>> Map phase ==> crawl news articles, process text, write the result to a
>> file.
>>       II
>>       II     pass (term, term_frequency) pair to the Reducer
>>       II
>>       V
>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
>> dictionary
>> Is this at all possible? Or is it inherently impossible due to the
>> structure
>> of Hadoop?

View raw message