lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Worms <da...@simpledesign.com>
Subject Avalonized WebCrawler
Date Mon, 27 Jan 2003 23:05:38 GMT

Lucene developers,

This mail follow a few threads which took place 2-3 months ago on both 
Lucene and Avalon lists:

http://marc.theaimsgroup.com/?l=lucene-dev&m=101518595918785&w=2
http://marc.theaimsgroup.com/?l=avalon-users&m=103706452017829&w=2

They were related to porting the WebCrawler app into a component based 
application using Avalon. During the past few days, I did just that and 
I will be happy to share the code with the community. There is still a 
lot to do, but my goal was to contact you once the code reach a similar 
level of development as the one in CVS. I did not contact the list 
before because I wasn't sure were I was going :), and because I do not 
have a CVS access at Apache.

You can download the code @ http://67.116.155.180/~wdavidw/crawler.zip

Both the sources and binaries are present. On my local environment, I 
use Maven as the build system. It isn't included in the dowload because 
some of the jar I used are recent CVS snapshots not present on the 
Maven remote location( ibiblio.org). If I am not mistaken, all the 
required library are present in the zip file.

Overall, the code behave just like the present crawler hosted on the 
Lucene Sandbox repository. Since I mostly did some re-factoring on this 
code-base, it will be quite easy for the developer(s) to find out what 
happens. All the comments, methods, ...., remains the same. I only 
changes the most relevant parts. You will find the code divided in 2 
packages, the original package "de.lanlab.*" and the new one 
"org.crawl.*". The reason behind this separation is that everytime I 
created a new component, I moved its code into the second package for 
clarity.

As the Avalon container, I choose to use Fortress. It is a stable and 
almost released container (a matter of weeks). I am seriously thinking 
about Merlin, but it is no priority for now.

Here is a list of the created components/services:

fetcher-task-factory
host-manager
host-resolver
url-message-factory
web-document-factory
message-handler
message-listener-selector
  . url-length-stage
  . url-scope-stage
  . robot-exclusion-stage
  . url-visited-stage
  . known-path-stage
  . fetcher-stage
storage-pipeline
thread-monitor
fetcher-thread-factory
server-thread-factory
url-normalizer
url-visited-manager
one more to appear: thread-pool-manager

Configuration:
At this time, every config property is hard coded in the component 
class. It will be a fast and easy task to integrate the config file 
because the component already implement the Avalon configuration 
lifecycle.

Logging:
I had some hard time using fortress logging service. For now, only two 
logger are working, one for the fortress system, the other for the 
crawler. Once i understand where the logging issues is coming from, 
each component could have his own logger without any code changes.

Integration:
Fortress can easily be plugged to any time of environment or as a 
standalone application. I am planning to write a phoenix block soon.

Client connection:
The current Observer service will change completly. Instead of printing 
informations to the console, it will export some sort of application 
state descriptor object via AltRMI, or anything else. It will be up to 
the client to render those information.

Speed:
When running the current code against the Avalonized one, I get very 
similar speed results. The only difference is that it takes somehow 
longer for the new one to reach a stable speed (about 15 secondes).

Avalon:
I kept having a simplistic use of Avalon. For now, I didn't want to use 
all the tools available. There are few domains were Avalon could 
provide more functionalities:
- the lifestyle handler (both in Fortress and Merlin), which could 
replace the usage of factories for example.
- the thread library, because I didn't want to change any of the 
current code.
- the event library, which will reinforce an SEDA architecture.

Javadocs:
None, I kept the ones present in the past. I will describe every 
service in more details soon, when I finish with all the refactoring.

Lucene:
I think Lucene should be separated from the crawler. One could easily 
write a service which will schedule crawling process and export the 
results. Then, this service could use those results to create/update a 
Lucene index.

Future:
I am committed to pursue the development of the crawler. I hope many 
current and future developers will follow me. With your consent, I 
would likely move this project to SourceForge, but all opinions are 
welcome.

David


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message