ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From szabkel <szab....@gmail.com>
Subject Continuously running jobs
Date Sun, 06 Mar 2016 10:55:13 GMT
Info: I used spaces to indent my examples, hope it displayed correctly.
 
I am *new to Ignite* and I would be really happy, If you could give me some
help. I am working on a* distributed web crawler* in Java8, would like to
use ignite to distribute jobs across the available nodes. A single job would
run a request on a url and parse some data. *I would like to behave well as
a crawler*, so I would like to *time the jobs really precisely*. Not
requesting the same server (with domain resolution) more than a dynamically
changing limit, while *continuously thinking about when to go back to a url
*to keep the data up to date. The software *would aim specific domains and
urls* and different contents needs to be parsed differently (basically it is
deep web crawling, scraping), so I image something like this:
Crawler
    Scheduler //which times and broadcasts the jobs
    JobGroups
        SiteXJobGroup //tells how to work on Site X
        SiteYJobGroup //tells how to work on Site Y
        Site...JobGroup

The scheduler would load the information (how often to run the specific
jobs) from a database (cron strings?). Runs the jobs parallel to each other
(because one job works with a single domain/group of servers and I don't
want to burden them with my traffic, a job is advancing slowly, but I can
run them in parallel to some extent). I should be able to extend the
application later, with new Jobs (every information about the process is
stored in database to be persistent and even if everything shuts down, could
continue from where it stopped).

My question is: Can I do the scheduling solely with Ignite? Rerunning jobs
that already completed after some time, running different jobs in the same
time.

A single job (specific page, this would get broadcasted to a node):
job(url argument)
    load the data
    parse the data
    return the data

A group of jobs (a single domain):
while(there is URL in the URLFrontier for the jobs)
    url <- pop url from URLFrontier
    result <- broadcast job(url) //broadcast here is easier, but maybe the
Scheduler should
    filtering the result after it came back (do I need it, already seen,
contained an error..etc)
    decide if a new URL is needed to add to the URLFrontier based on result
(example: next page)

Scheduler would run all the time? I don't really know how to set this up. It
should refresh/load which which jobs are needed to run (the jobgroups have
an ID or something), start a jobgroup and another beside it to a limit,
after one jobgroup is done, start another one, if everything is done, start
again.

I think I understand the ComputeContinuousMapperExample in ignite/examles,
but I need to run multiple Continuous Mappers, how do I achieve this? ( 
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/computegrid/ComputeContinuousMapperExample.java
<https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/computegrid/ComputeContinuousMapperExample.java>
 
)

Thank you for your help, really, I appreciate it!



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Continuously-running-jobs-tp3376.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Mime
View raw message