lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <c...@lanlab.de>
Subject Re: Avalonized WebCrawler
Date Tue, 28 Jan 2003 11:24:05 GMT
Great news, this will push us forward!

Will have a look on it immediately (after breakfast, of course ! :-)

Clemens

----- Original Message -----
From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>; "Avalon
framework users" <avalon-users@jakarta.apache.org>
Sent: Tuesday, January 28, 2003 12:55 AM
Subject: Re: Avalonized WebCrawler


> Oh, no need to swallow any pride - some of us have been meaning to do
> this.....when we have more time...hah.
> So just a big thank you from us!
>
> Otis
>
>
> --- Paul Hammant <Paul_Hammant@yahoo.com> wrote:
> > David,
> >
> > Great work.  I sure hope the Lucene peeps can swallow (a little)
> > pride
> > and merge the best bits.  It is always difficult receiving a mountain
> > of
> > changes...
> >
> > I look forward to using some of the componentsoutside Lucene, and the
> >
> > whole thing inside Phoenix when you have it ready :-)))
> >
> > - Paul H
> > (hammant@apache)
> >
> > >
> > > Lucene developers,
> > >
> > > This mail follow a few threads which took place 2-3 months ago on
> > both
> > > Lucene and Avalon lists:
> > >
> > > http://marc.theaimsgroup.com/?l=lucene-dev&m=101518595918785&w=2
> > > http://marc.theaimsgroup.com/?l=avalon-users&m=103706452017829&w=2
> > >
> > > They were related to porting the WebCrawler app into a component
> > based
> > > application using Avalon. During the past few days, I did just that
> >
> > > and I will be happy to share the code with the community. There is
> > > still a lot to do, but my goal was to contact you once the code
> > reach
> > > a similar level of development as the one in CVS. I did not contact
> >
> > > the list before because I wasn't sure were I was going :), and
> > because
> > > I do not have a CVS access at Apache.
> > >
> > > You can download the code @
> > http://67.116.155.180/~wdavidw/crawler.zip
> > >
> > > Both the sources and binaries are present. On my local environment,
> > I
> > > use Maven as the build system. It isn't included in the dowload
> > > because some of the jar I used are recent CVS snapshots not present
> > on
> > > the Maven remote location( ibiblio.org). If I am not mistaken, all
> > the
> > > required library are present in the zip file.
> > >
> > > Overall, the code behave just like the present crawler hosted on
> > the
> > > Lucene Sandbox repository. Since I mostly did some re-factoring on
> > > this code-base, it will be quite easy for the developer(s) to find
> > out
> > > what happens. All the comments, methods, ...., remains the same. I
> > > only changes the most relevant parts. You will find the code
> > divided
> > > in 2 packages, the original package "de.lanlab.*" and the new one
> > > "org.crawl.*". The reason behind this separation is that everytime
> > I
> > > created a new component, I moved its code into the second package
> > for
> > > clarity.
> > >
> > > As the Avalon container, I choose to use Fortress. It is a stable
> > and
> > > almost released container (a matter of weeks). I am seriously
> > thinking
> > > about Merlin, but it is no priority for now.
> > >
> > > Here is a list of the created components/services:
> > >
> > > fetcher-task-factory
> > > host-manager
> > > host-resolver
> > > url-message-factory
> > > web-document-factory
> > > message-handler
> > > message-listener-selector
> > >  . url-length-stage
> > >  . url-scope-stage
> > >  . robot-exclusion-stage
> > >  . url-visited-stage
> > >  . known-path-stage
> > >  . fetcher-stage
> > > storage-pipeline
> > > thread-monitor
> > > fetcher-thread-factory
> > > server-thread-factory
> > > url-normalizer
> > > url-visited-manager
> > > one more to appear: thread-pool-manager
> > >
> > > Configuration:
> > > At this time, every config property is hard coded in the component
> > > class. It will be a fast and easy task to integrate the config file
> >
> > > because the component already implement the Avalon configuration
> > > lifecycle.
> > >
> > > Logging:
> > > I had some hard time using fortress logging service. For now, only
> > two
> > > logger are working, one for the fortress system, the other for the
> > > crawler. Once i understand where the logging issues is coming from,
> >
> > > each component could have his own logger without any code changes.
> > >
> > > Integration:
> > > Fortress can easily be plugged to any time of environment or as a
> > > standalone application. I am planning to write a phoenix block
> > soon.
> > >
> > > Client connection:
> > > The current Observer service will change completly. Instead of
> > > printing informations to the console, it will export some sort of
> > > application state descriptor object via AltRMI, or anything else.
> > It
> > > will be up to the client to render those information.
> > >
> > > Speed:
> > > When running the current code against the Avalonized one, I get
> > very
> > > similar speed results. The only difference is that it takes somehow
> >
> > > longer for the new one to reach a stable speed (about 15 secondes).
> > >
> > > Avalon:
> > > I kept having a simplistic use of Avalon. For now, I didn't want to
> >
> > > use all the tools available. There are few domains were Avalon
> > could
> > > provide more functionalities:
> > > - the lifestyle handler (both in Fortress and Merlin), which could
> > > replace the usage of factories for example.
> > > - the thread library, because I didn't want to change any of the
> > > current code.
> > > - the event library, which will reinforce an SEDA architecture.
> > >
> > > Javadocs:
> > > None, I kept the ones present in the past. I will describe every
> > > service in more details soon, when I finish with all the
> > refactoring.
> > >
> > > Lucene:
> > > I think Lucene should be separated from the crawler. One could
> > easily
> > > write a service which will schedule crawling process and export the
> >
> > > results. Then, this service could use those results to
> > create/update a
> > > Lucene index.
> > >
> > > Future:
> > > I am committed to pursue the development of the crawler. I hope
> > many
> > > current and future developers will follow me. With your consent, I
> > > would likely move this project to SourceForge, but all opinions are
> >
> > > welcome.
> > >
> > > David
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > <mailto:avalon-users-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > > <mailto:avalon-users-help@jakarta.apache.org>
> > >
> > >
> > >
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message