hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Loddengaard" <alex...@google.com>
Subject Re: Basic code organization questions + scheduling
Date Mon, 08 Sep 2008 02:27:39 GMT
Hi Tarjei,

You should take a look at Nutch.  It's a search-engine built on Lucene,
though it can be setup on top of Hadoop.  Take a look:


Hope this helps!


On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <tarjei@nu.no> wrote:

> Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
> tasks. The basic flow is
> input:    array of urls
> actions:          |
> 1.              get pages
>                      |
> 2.          extract new urls from pages -> start new job
>            extract text  -> index / filter (as new jobs)
> What I'm considering is how I should build this application to fit into the
> map/reduce context. I'm thinking that step 1 and 2 should be separate
> map/reduce tasks that then pipe things on to the next step.
> This is where I am a bit at loss to see how it is smart to organize the
> code in logical units and also how to spawn new tasks when an old one is
> over.
> Is the usual way to control the flow of a set of tasks to have an external
> application running that listens to jobs ending via the endNotificationUri
> and then spawns new tasks or should the job itself contain code to create
> new jobs? Would it be a good idea to use Cascading here?
> I'm also considering how I should do job scheduling (I got a lot of
> reoccurring tasks). Has anyone found a good framework for job control of
> reoccurring tasks or should I plan to build my own using quartz ?
> Any tips/best practices with regard to the issues described above are most
> welcome. Feel free to ask further questions if you find my descriptions of
> the issues lacking.
> Kind regards,
> Tarjei

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message