lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hammant <>
Subject Re: Avalonized WebCrawler
Date Mon, 27 Jan 2003 23:32:00 GMT

Great work.  I sure hope the Lucene peeps can swallow (a little) pride 
and merge the best bits.  It is always difficult receiving a mountain of 

I look forward to using some of the componentsoutside Lucene, and the 
whole thing inside Phoenix when you have it ready :-)))

- Paul H

> Lucene developers,
> This mail follow a few threads which took place 2-3 months ago on both 
> Lucene and Avalon lists:
> They were related to porting the WebCrawler app into a component based 
> application using Avalon. During the past few days, I did just that 
> and I will be happy to share the code with the community. There is 
> still a lot to do, but my goal was to contact you once the code reach 
> a similar level of development as the one in CVS. I did not contact 
> the list before because I wasn't sure were I was going :), and because 
> I do not have a CVS access at Apache.
> You can download the code @
> Both the sources and binaries are present. On my local environment, I 
> use Maven as the build system. It isn't included in the dowload 
> because some of the jar I used are recent CVS snapshots not present on 
> the Maven remote location( If I am not mistaken, all the 
> required library are present in the zip file.
> Overall, the code behave just like the present crawler hosted on the 
> Lucene Sandbox repository. Since I mostly did some re-factoring on 
> this code-base, it will be quite easy for the developer(s) to find out 
> what happens. All the comments, methods, ...., remains the same. I 
> only changes the most relevant parts. You will find the code divided 
> in 2 packages, the original package "de.lanlab.*" and the new one 
> "org.crawl.*". The reason behind this separation is that everytime I 
> created a new component, I moved its code into the second package for 
> clarity.
> As the Avalon container, I choose to use Fortress. It is a stable and 
> almost released container (a matter of weeks). I am seriously thinking 
> about Merlin, but it is no priority for now.
> Here is a list of the created components/services:
> fetcher-task-factory
> host-manager
> host-resolver
> url-message-factory
> web-document-factory
> message-handler
> message-listener-selector
>  . url-length-stage
>  . url-scope-stage
>  . robot-exclusion-stage
>  . url-visited-stage
>  . known-path-stage
>  . fetcher-stage
> storage-pipeline
> thread-monitor
> fetcher-thread-factory
> server-thread-factory
> url-normalizer
> url-visited-manager
> one more to appear: thread-pool-manager
> Configuration:
> At this time, every config property is hard coded in the component 
> class. It will be a fast and easy task to integrate the config file 
> because the component already implement the Avalon configuration 
> lifecycle.
> Logging:
> I had some hard time using fortress logging service. For now, only two 
> logger are working, one for the fortress system, the other for the 
> crawler. Once i understand where the logging issues is coming from, 
> each component could have his own logger without any code changes.
> Integration:
> Fortress can easily be plugged to any time of environment or as a 
> standalone application. I am planning to write a phoenix block soon.
> Client connection:
> The current Observer service will change completly. Instead of 
> printing informations to the console, it will export some sort of 
> application state descriptor object via AltRMI, or anything else. It 
> will be up to the client to render those information.
> Speed:
> When running the current code against the Avalonized one, I get very 
> similar speed results. The only difference is that it takes somehow 
> longer for the new one to reach a stable speed (about 15 secondes).
> Avalon:
> I kept having a simplistic use of Avalon. For now, I didn't want to 
> use all the tools available. There are few domains were Avalon could 
> provide more functionalities:
> - the lifestyle handler (both in Fortress and Merlin), which could 
> replace the usage of factories for example.
> - the thread library, because I didn't want to change any of the 
> current code.
> - the event library, which will reinforce an SEDA architecture.
> Javadocs:
> None, I kept the ones present in the past. I will describe every 
> service in more details soon, when I finish with all the refactoring.
> Lucene:
> I think Lucene should be separated from the crawler. One could easily 
> write a service which will schedule crawling process and export the 
> results. Then, this service could use those results to create/update a 
> Lucene index.
> Future:
> I am committed to pursue the development of the crawler. I hope many 
> current and future developers will follow me. With your consent, I 
> would likely move this project to SourceForge, but all opinions are 
> welcome.
> David
> -- 
> To unsubscribe, e-mail:   
> <>
> For additional commands, e-mail: 
> <>

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message