hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Date Sat, 11 Dec 2010 07:34:43 GMT
If you are only loading articles at that rate, I would suggest that a simple
java or perl or ruby program would be MUCH easy to write and debug than a
full on map-reduce program.

2010/12/10 Edward Choi <mp2893@gmail.com>

> Thanks for the advice. But my plan is to crawl news rss feeds every 30
> minutes. So I'd be downloading at most 5 to 10 news articles per map task
> (since news aren't published that often). So I guess I won't have to worry
> to much about the crawling dealy.
> I thought it would be a good idea to make a dictionary during the crawling
> process. Because I will be needing the a dictionary to calculate tf-idf and
> I didn't want to have to go through the whole repository everytime a news
> aricle is added.
> If I crawl and make a dictionary at the same time, all I need to do to make
> a dictionary is to merge the new ones (which are generated every 30 minutes)
> with the existing dictionary which I guess will be computationally cheap.
>
> Ed
>
> From mp2893's iPhone
>
> On 2010. 12. 11., at 오전 3:42, Ted Dunning <tdunning@maprtech.com> wrote:
>
> > Regarding the idea of doing word counts during the crawl, I think you are
> > motivated by the best of principles (read
> > input only once), but in practice, you will be doing many small crawls
> and
> > saving the content.  Word counting
> > should probably not be tied too closely to the crawl because the crawl
> can
> > be delayed arbitrarily.  Better to have
> > a good content repository that is updated as often as crawls complete and
> > run other processing against the
> > repository whenever it seems like a good idea.
> >
> > 2010/12/10 Edward Choi <mp2893@gmail.com>
> >
> >> Thanks for the tip. I guess it's a little different project from Nutch.
> My
> >> understanding is that while Nutch tries to implement a whole web search
> >> package, Bixo focuses on the crawling part. I should look into both
> projects
> >> more deeply. Thanks again!!
> >>
> >> Ed
> >>
> >> From mp2893's iPhone
> >>
> >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <tdunning@maprtech.com>
> wrote:
> >>
> >>> That is definitely possible, but may not be very desirable.
> >>>
> >>> Take a look at the Bixo project for a full-scale crawler.  There is a
> lot
> >> of
> >>> subtlety in the fetching of URL's
> >>> due to the varying quality of different sites and the interaction with
> >> crawl
> >>> choking due to robots.txt considerations.
> >>>
> >>> http://bixo.101tec.com/
> >>>
> >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <mp2893@gmail.com> wrote:
> >>>
> >>>> So my design is:
> >>>> Map phase ==> crawl news articles, process text, write the result
to a
> >>>> file.
> >>>>      II
> >>>>      II     pass (term, term_frequency) pair to the Reducer
> >>>>      II
> >>>>      V
> >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create
a
> >>>> dictionary
> >>>>
> >>>> Is this at all possible? Or is it inherently impossible due to the
> >>>> structure
> >>>> of Hadoop?
> >>>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message