hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Fri, 10 Dec 2010 18:42:41 GMT
Regarding the idea of doing word counts during the crawl, I think you are
motivated by the best of principles (read
input only once), but in practice, you will be doing many small crawls and
saving the content.  Word counting
should probably not be tied too closely to the crawl because the crawl can
be delayed arbitrarily.  Better to have
a good content repository that is updated as often as crawls complete and
run other processing against the
repository whenever it seems like a good idea.

2010/12/10 Edward Choi <mp2893@gmail.com>

> Thanks for the tip. I guess it's a little different project from Nutch. My
> understanding is that while Nutch tries to implement a whole web search
> package, Bixo focuses on the crawling part. I should look into both projects
> more deeply. Thanks again!!
>
> Ed
>
> From mp2893's iPhone
>
> On 2010. 12. 11., at 오전 1:15, Ted Dunning <tdunning@maprtech.com> wrote:
>
> > That is definitely possible, but may not be very desirable.
> >
> > Take a look at the Bixo project for a full-scale crawler.  There is a lot
> of
> > subtlety in the fetching of URL's
> > due to the varying quality of different sites and the interaction with
> crawl
> > choking due to robots.txt considerations.
> >
> > http://bixo.101tec.com/
> >
> > On Thu, Dec 9, 2010 at 11:27 PM, edward choi <mp2893@gmail.com> wrote:
> >
> >> So my design is:
> >> Map phase ==> crawl news articles, process text, write the result to a
> >> file.
> >>       II
> >>       II     pass (term, term_frequency) pair to the Reducer
> >>       II
> >>       V
> >> Reduce phase ==> Merge the (term, term_frequency) pair and create a
> >> dictionary
> >>
> >> Is this at all possible? Or is it inherently impossible due to the
> >> structure
> >> of Hadoop?
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message