hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Choi <mp2...@gmail.com>
Subject Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Sat, 11 Dec 2010 02:42:08 GMT
Thanks for the advice. But my plan is to crawl news rss feeds every 30 minutes. So I'd be downloading
at most 5 to 10 news articles per map task (since news aren't published that often). So I
guess I won't have to worry to much about the crawling dealy. 
I thought it would be a good idea to make a dictionary during the crawling process. Because
I will be needing the a dictionary to calculate tf-idf and I didn't want to have to go through
the whole repository everytime a news aricle is added. 
If I crawl and make a dictionary at the same time, all I need to do to make a dictionary is
to merge the new ones (which are generated every 30 minutes) with the existing dictionary
which I guess will be computationally cheap. 


From mp2893's iPhone

On 2010. 12. 11., at 오전 3:42, Ted Dunning <tdunning@maprtech.com> wrote:

> Regarding the idea of doing word counts during the crawl, I think you are
> motivated by the best of principles (read
> input only once), but in practice, you will be doing many small crawls and
> saving the content.  Word counting
> should probably not be tied too closely to the crawl because the crawl can
> be delayed arbitrarily.  Better to have
> a good content repository that is updated as often as crawls complete and
> run other processing against the
> repository whenever it seems like a good idea.
> 2010/12/10 Edward Choi <mp2893@gmail.com>
>> Thanks for the tip. I guess it's a little different project from Nutch. My
>> understanding is that while Nutch tries to implement a whole web search
>> package, Bixo focuses on the crawling part. I should look into both projects
>> more deeply. Thanks again!!
>> Ed
>> From mp2893's iPhone
>> On 2010. 12. 11., at 오전 1:15, Ted Dunning <tdunning@maprtech.com> wrote:
>>> That is definitely possible, but may not be very desirable.
>>> Take a look at the Bixo project for a full-scale crawler.  There is a lot
>> of
>>> subtlety in the fetching of URL's
>>> due to the varying quality of different sites and the interaction with
>> crawl
>>> choking due to robots.txt considerations.
>>> http://bixo.101tec.com/
>>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <mp2893@gmail.com> wrote:
>>>> So my design is:
>>>> Map phase ==> crawl news articles, process text, write the result to a
>>>> file.
>>>>      II
>>>>      II     pass (term, term_frequency) pair to the Reducer
>>>>      II
>>>>      V
>>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
>>>> dictionary
>>>> Is this at all possible? Or is it inherently impossible due to the
>>>> structure
>>>> of Hadoop?

View raw message