lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <>
Subject Re: Configuration RFC
Date Mon, 15 Jul 2002 11:51:51 GMT
> Below are some comments and questions from someone just getting into the
> crawling concepts, but trying to provide constructive ideas.

Very very good. That's the kind of discussion I wanted.

> 1) The MessageQueue system seems to be somewhat problematic because of
> memory issues. This seems like it should be an abstract class with a few
> potential options including your CachingQueue, and a SQLQueue that would
> handle many of the issue of memory and persistence for large scale crawls.

Probably another misleading name. The MessageHandler is an active component
that runs in its own thread. You can only put messages into the handler like
in a queue, but then they are processed by the message handler thread, which
runs on high priority. The caching queue is just a utility class that
implements an abstract queue that holds most of its contents on secondary
storage. That means memory is only constrained by the size of the hard drive
and the number of files that can be saved in a directory. Do you think
that's not enough?

> 2) Extensible Priority Queue. As you were talking about limiting the
> of threads that access one host at a time, but this might fly in the face
> the URL reordering concept that you write about later. So if this were
> somehow an interface which had different options, this might be more
> flexible.

Could you be more specific on that?
At the moment there's an interface Queue:
    public Object remove();
    public void insert(Object o);
    public void insertMultiple(Collection c);
    public int size();
and a base class called TaskQueue which contains the tasks from the
ThreadPool. Fetcher extends this Queue and uses a FetcherTaskQueue, which
contains a small CachingQueue for each server and gets threads in a
round-robin manner.

                     Queue (i)
           ^                             ^
     CachingQueue                    TaskQueue   <-used in larm.threads
               ^--------uses----- FetcherTaskQueue <- used in larm.fetcher

You're right that this could be ameliorated. Do you have a suggestion?

> 3) Distribution does seem like a problem to be solved (but my guess is in
> longer term). With a distributed system, it seems like it would be best to
> have as little communication as possible between the different units. One
> thought would be as you stated to partition up the work. The only thought
> have would be to be able to do this at a domain level and not just a
> directory level.

I could imagine other ways to partition it. I.e. take the hashValue of each
URL and partition the space between Integer.MIN_VALUE and Integer.MAX_VALUE.
I think partitioning with host names can become highly imbalanced.
Regarding communication, you are completely right. This has to take place in
batch mode, with a couple of hundred to thousand URLs at a time.
There's a paper on this on the www2002 cdrom ("Parallel Crawlers") that
expresses what I was thinking about for two years now ;-|

> 5) Potentially adding a receiving pipeline. You have talked about this as
> storage pipeline, but I don't think it should be connected to storage. For
> example, I think that processing should occur and then go to storage.
> a File System or SQL based storage. The storage should not be related to
> post processing.

Well it doesn't have to. The difference between the message queue and the
storage pipeline is that the first is triggered by the message handler and
contains URLMessages and the latter is triggered by the fetcher threads and
contain WebDocuments. For some reason I felt they would be similar, so both
are derived from a "Message" class. In fact WebDocument is derived from
URLMessage, which is not totally correct, since URLMessage is used for links
and contains a referer URL, and a WebDocument is not a link. The right way
would probably be
     ^                     ^
   Link                (Web)Document

The term "Web" is only correct for http documents, but since it could be any
doc. this could be left out as well.
The reason why documents are put into a storage pipeline and not something
like a storage queue is that it doesn't make sense to queue the high amout
of data, only to store it a short time later again.
But there's no reason not to use the storage pipeline for processing. If you
want to store the raw data, load it later again and process it, the only
thing to be defined is a "document source" that reads documents and puts
them in a processing/storage pipeline.
I see that we're talking about the right names here. Rename StoragePipeline
to ProcessingPipeline, and you're done. Storage is then maybe a special kind
of processing...?

> Also, the link parsing should be part of this processing
> and not the fetching.

True that the doc retrieval has to be divided from the processing part. The method is completely bloated and was subject to a lot of
At this time it is completely limited to HTTP URLs, doesn't do intelligent
mime type processing etc.
The reason why it was done within the fetcher threads are that the
implementation of the storage pipeline is pretty new, and that the scope I
had in mind was limited (large intranets). Parsing could as well be done
within the processing pipeline.

> This might also make it more scalable since you could
> distribute the load better.
Why do you think so? I don't see a lot of difference there.
Let me think about it. If you're talking about the thread of control where
the processing is done, I think this should be the fetcher threads, at least
optionally. This could be done right now since the storages are reentrant
(or at least should be) and are called by the fetchers. The processing
pipeline should be done in parallel, to take advantage of additional
processors (I mean real MPs), if there are any.
On the other hand, the fetcher threads have to be able to get a bunch of
URLMessages, fetch them, and put them in the processing pipeline all at
once, to reduce synchronization. At this time documents are put into the
storage pipeline one after the other, which has to be synchronized in two
places: Where the actual document is stored and where the links are put back
into the message handler. That's too much and probably one of the main
reasons why the crawler doesn't scale up well to 100 threads or more.

> 5) Here are a few items that I see as potential bottle necks. Are there
> others that you want to account for?
> A) The time to connect to the site. (Network IO constraint)
> B) The time to download the page. (Network and file system IO constraint)
> C) Parsing the page. (CPU and Memory constraint)
> D) Managing Multiple Threads (CPU constraint)
> E) List of Visited links (Memory constraint)

F) storing the page is also a constraint, especially with SQL databases
(network/file system IO) or when you use the current LuceneStorage (Otis
knows what I'm talking about...).
G) logging (IO constraint). It's buffered now, and flushed every second time
the ThreadMonitor awakes (every 10 seconds at this time). Besides, the
SimpleLogger class is not thread safe and logging is only done in one thread
whenever possible. I have used Log4J before but never in these performance
critical situations, so I can't tell how it would behave here.
Since logging is done so extensively it can become a perfomance killer. I.e.
when you log a date each time a new Date object was created. I don't know if
Log4J reuses its objects well. I mostly turned date creation off because it
flooded the heap.

If you want to see another source for heap drain, look how
url.toExternalForm() works...

> 6) Things I am going to try to find out from the code:
> Overall class naming convention / architecture. Class Diagram.

Yep; can you reverse engineer that from the code?
I think the threads, storage, and parser packages are ok. The thread pool
was influenced by an article in "Java-Magazin" 2 years ago.
The fetcher package should probably be broken up somehow.
The util package is really bad since it should not have any relationships
with the other packages. Especially the WebDocument needs to be moved out
there. The gui package is outdated, and the graph package just contains a
first attempt to build a graph database of the links crawled.
I've written a much better one recently, one that compresses URLs and which
can be saved and reloaded. Unfortunately, it needs a sorted list of URLs as
its input and thus cannot be used by the URLVisitedFilter. The only way to
circumvent the E) constraint I see is using a BTree, but that would impose
great new IO constraints since this class is used so often.

Other conventions: The source code is, although not intended, formatted
almost in concordance with the Avalon conventions
( which I like very
well (I really hate when opening brackets are not on their own line, so
don't make me change that :-) I use "this." instead of m_ for members
although I was used to that in my MFC times.

> Source types handled (HTTP, FTP, FILE, SQL?)

The only part that is limited to HTTP at this time is the FetcherTask. I had
so many problems with URLConnection that I had to replace it with the
HTTPClient; but that client doesn't

> Authentication - How does LARM handle this, what types are supported
> (digest, ssl, form)

Could be handled by HTTPClient (I think all of it) but I haven't tried out.

> Frames - Is there a encompassing reference file name or is each individual
> file. What if you want to display the page?

There's no non-ambiguous way to determine in which frame a file was put,
since it can be linked by a lot of sources. The links.log contains the link
type (0=normal link, 1=frame src, 2=redirect) for each link recorded.
If you want to get the frame URL but index the real contents you have to do
an analysis of the overall graph structure. I.e. take the inbound frame link
with the highest page rank, etc.

> Cookies and Headers - Support for cookies / HTTP headers

All supported by the HTTPClient. I think Cookies are saved and resent
automatically. Only a subset of the received HTTP headers are written to the
logs at this time. If the cookie storage becomes a problem, you can also
change the internal CookieModule to cater your needs.
I personally think HTTPClient is designed very well; I have used it for more
than half a year now and it simply works. That's why I don't want to spend
my time to move to the Apache HTTP library, although it may also be working.
If you wonder about the strange way how it is treated within the build
process: I had to adapt a couple of methods of the HTTPClient before I was
told to put it in the Apache CVS.

> Javascript - How does it handle links made through javascript (error out,
> ignore, handle them?)

Nope, and I don't see a solution for all but the most trivial cases
(document.location.href=...) to handle this. Will lead to a malformed URL at
this time and be filtered out.

Last thing: Many of the utility classes have a test method in their main()
method. This has to be moved out to some JUnit test cases. I think Mehran
wanted to do something there.


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message