Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-Id: <00ab01c2195f$79aca3e0$57e0fea9@majesty>
Reply-To: "Clemens Marschner" <Clemens.Marschner@internet.lmu.de>
From: "Clemens Marschner" <cmad@lanlab.de>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
References: <20020621134431.37742.qmail@web12708.mail.yahoo.com>
Subject: LARM Crawler: Repository
Date: Fri, 21 Jun 2002 22:06:25 +0200
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Ok I think I got your point.

> You have MySQL to hold your links.
> You have N crawler threads.
> You don't want to hit MySQL a lot, so you get links to crawl in batches
> (e.g. each crawler thread tells MySQL: give me 1000 links to crawl).


[just to make it clear: this looks like the threads would be "pulling"
tasks. In fact the MessageHandler pushes tasks that are then distributed to
the thread pool.]

That's still one of the things I wanted to change in the next time: at the
moment the message processing pipeline is not working in batch mode. Each
URLMessage is transmitted on its own. I already wanted to change this in
order to reduce the number of synchronization points. But it was not top
priority because the overall process was still pretty much I/O bound.
But from my early experiments with the Repository I see now that this
becomes more and more important.

I read the paper about the WebRace crawler
(http://citeseer.nj.nec.com/zeinalipour-yazti01highperformance.html, pp.
12-15) and thought about the following way the repository should work:
- for each URL put into the pipeline, look if it has already been crawled,
and if, save the lastModified date into the URLMessage
- in the crawling task, look if the lastModified timestamp is set. if so,
send an "If-Modified-Since" header along with the GET command
-> if a 304 (not-modified-since) is the answer, load the outbound links out
of the repository and put them back into the pipeline
-> if a 404 (or similar) statuscode is returned, delete the file and its
links from the repository
-> if it was modified, delete the old stored links, update the timestamp of
the doc in the repository and continue as if the file was new
- if the file is new, load it, parse it, save its links, and put them back
into the pipeline

(Any better ideas?)

I have already implemented a rather naive approach of that today, which (by
no surprise) turns out to be slower than crawling everything from the
start...

What I've learned:
- The repository must load the information about already crawled documents
into main memory after the start (which means the main memory must be large
enough to hold all these URLs + some extra info, which is already done in
URLVisitedFilter at this time) and, more importantly...
- have a more efficient means of accessing the links than in a regular SQL
table with {referer, target} pairs. The Meta-Info store mentioned in the
WebRace crawler may be a solution (it's a plain text file that contains all
document meta-data and whose index is held in main memory), but it prevents
the URLs from being sorted other ways (e.g. all INlinks to a document),
which is what I need for my further studies.

> The crawler fetches all pages, and they go through your component
> pipeline and get processed.
> What happens if after fetching 100 links from this batch of 1000 the
> crawler thread dies?  Do you keep track of which links in that batch
> you've crawled, so that in case the thread dies you don't recrawl
> those?
> That's roughly what I meant.

First of all, I have invested a lot of time to prevent any threads from
dying. That's a reason why I took HTTPClient, because it has never hung so
far. A lot of exceptions are caught on the task level. I've had a lot of
problems with hanging threads when I still used the java.net.URLConnection
classes, but no more.
I have also learned that "whatever can go wrong, will go wrong, very soon".
That is why I patched the HTTPClient classes to introduce a maximum file
size to be fetched.
I can imagine some sort of crawler trap when a server process sends
characters very slowly, as it is used in some spam filters. That's where the
ThreadMonitor comes in. Each task publishes its state (i.e. "loading data"),
and the ThreadMonitor restarts it when it remains in a state for too long.
That's the place where the ThreadMonitor could save the rest of the batch.
This way, the ThreadMonitor could become the single point of failure, but
the risk that this thread is hanging is reduced by keeping it simple. Just
like a watchdog hardware that looks that traffic lights work at a street
crossing...

Regards,

Clemens


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>