Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 41823 invoked from network); 21 Jun 2002 20:10:05 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 21 Jun 2002 20:10:05 -0000 Received: (qmail 10524 invoked by uid 97); 21 Jun 2002 20:10:13 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 10457 invoked by uid 97); 21 Jun 2002 20:10:12 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 10439 invoked by uid 98); 21 Jun 2002 20:10:12 -0000 X-Antivirus: nagoya (v4198 created Apr 24 2002) Message-Id: <00ab01c2195f$79aca3e0$57e0fea9@majesty> Reply-To: "Clemens Marschner" From: "Clemens Marschner" To: "Lucene Developers List" References: <20020621134431.37742.qmail@web12708.mail.yahoo.com> Subject: LARM Crawler: Repository Date: Fri, 21 Jun 2002 22:06:25 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-Mimeole: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Ok I think I got your point. > You have MySQL to hold your links. > You have N crawler threads. > You don't want to hit MySQL a lot, so you get links to crawl in batches > (e.g. each crawler thread tells MySQL: give me 1000 links to crawl). [just to make it clear: this looks like the threads would be "pulling" tasks. In fact the MessageHandler pushes tasks that are then distributed to the thread pool.] That's still one of the things I wanted to change in the next time: at the moment the message processing pipeline is not working in batch mode. Each URLMessage is transmitted on its own. I already wanted to change this in order to reduce the number of synchronization points. But it was not top priority because the overall process was still pretty much I/O bound. But from my early experiments with the Repository I see now that this becomes more and more important. I read the paper about the WebRace crawler (http://citeseer.nj.nec.com/zeinalipour-yazti01highperformance.html, pp. 12-15) and thought about the following way the repository should work: - for each URL put into the pipeline, look if it has already been crawled, and if, save the lastModified date into the URLMessage - in the crawling task, look if the lastModified timestamp is set. if so, send an "If-Modified-Since" header along with the GET command -> if a 304 (not-modified-since) is the answer, load the outbound links out of the repository and put them back into the pipeline -> if a 404 (or similar) statuscode is returned, delete the file and its links from the repository -> if it was modified, delete the old stored links, update the timestamp of the doc in the repository and continue as if the file was new - if the file is new, load it, parse it, save its links, and put them back into the pipeline (Any better ideas?) I have already implemented a rather naive approach of that today, which (by no surprise) turns out to be slower than crawling everything from the start... What I've learned: - The repository must load the information about already crawled documents into main memory after the start (which means the main memory must be large enough to hold all these URLs + some extra info, which is already done in URLVisitedFilter at this time) and, more importantly... - have a more efficient means of accessing the links than in a regular SQL table with {referer, target} pairs. The Meta-Info store mentioned in the WebRace crawler may be a solution (it's a plain text file that contains all document meta-data and whose index is held in main memory), but it prevents the URLs from being sorted other ways (e.g. all INlinks to a document), which is what I need for my further studies. > The crawler fetches all pages, and they go through your component > pipeline and get processed. > What happens if after fetching 100 links from this batch of 1000 the > crawler thread dies? Do you keep track of which links in that batch > you've crawled, so that in case the thread dies you don't recrawl > those? > That's roughly what I meant. First of all, I have invested a lot of time to prevent any threads from dying. That's a reason why I took HTTPClient, because it has never hung so far. A lot of exceptions are caught on the task level. I've had a lot of problems with hanging threads when I still used the java.net.URLConnection classes, but no more. I have also learned that "whatever can go wrong, will go wrong, very soon". That is why I patched the HTTPClient classes to introduce a maximum file size to be fetched. I can imagine some sort of crawler trap when a server process sends characters very slowly, as it is used in some spam filters. That's where the ThreadMonitor comes in. Each task publishes its state (i.e. "loading data"), and the ThreadMonitor restarts it when it remains in a state for too long. That's the place where the ThreadMonitor could save the rest of the batch. This way, the ThreadMonitor could become the single point of failure, but the risk that this thread is hanging is reduced by keeping it simple. Just like a watchdog hardware that looks that traffic lights work at a street crossing... Regards, Clemens -- To unsubscribe, e-mail: For additional commands, e-mail: