lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <>
Subject Re: Configuration RFC
Date Mon, 15 Jul 2002 03:49:44 GMT
Hi Clemens,

I read the document you put together about this crawler. Thanks.

Below are some comments and questions from someone just getting into the
crawling concepts, but trying to provide constructive ideas. I have not
looked at the code yet, but that's next on the list. I hope this is helpful
and provides a good dialog.

1) The MessageQueue system seems to be somewhat problematic because of
memory issues. This seems like it should be an abstract class with a few
potential options including your CachingQueue, and a SQLQueue that would
handle many of the issue of memory and persistence for large scale crawls.

2) Extensible Priority Queue. As you were talking about limiting the number
of threads that access one host at a time, but this might fly in the face of
the URL reordering concept that you write about later. So if this were
somehow an interface which had different options, this might be more

3) Distribution does seem like a problem to be solved (but my guess is in
longer term). With a distributed system, it seems like it would be best to
have as little communication as possible between the different units. One
thought would be as you stated to partition up the work. The only thought I
have would be to be able to do this at a domain level and not just a
directory level. 

5) Potentially adding a receiving pipeline. You have talked about this as a
storage pipeline, but I don't think it should be connected to storage. For
example, I think that processing should occur and then go to storage. Either
a File System or SQL based storage. The storage should not be related to the
post processing. Also, the link parsing should be part of this processing
and not the fetching. This might also make it more scalable since you could
distribute the load better.

5) Here are a few items that I see as potential bottle necks. Are there any
others that you want to account for?
A) The time to connect to the site. (Network IO constraint)
B) The time to download the page. (Network and file system IO constraint)
C) Parsing the page. (CPU and Memory constraint)
D) Managing Multiple Threads (CPU constraint)
E) List of Visited links (Memory constraint)

6) Things I am going to try to find out from the code:

Overall class naming convention / architecture. Class Diagram.

Source types handled (HTTP, FTP, FILE, SQL?)

Authentication - How does LARM handle this, what types are supported
(digest, ssl, form)

Frames - Is there a encompassing reference file name or is each individual
file. What if you want to display the page?

Cookies and Headers - Support for cookies / HTTP headers

Javascript - How does it handle links made through javascript (error out,
ignore, handle them?)

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message