lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <>
Subject Re: Configuration RFC
Date Sat, 13 Jul 2002 19:37:34 GMT

This looks very exciting. I have not had a chance to look at your LARM code,
but this overview is fairly informative.

Question 1.

I think you may have mentioned it, but how do the Sources fit in.

For example, one of the goal I am trying to get would be to get a URL
(xyz.html) and then change the url based on some pattern. So change xyz.html
to xyz.xml and get the .xml as a request.
I think you mentioned below that the source would send another request
message to get the data and then eat the message if it didn't also want to
get the .html file? Or would this be the message handler?

Also, would the source be able to say, I want to get all files which meet
this pattern and blindly attempt to get the set of files with changing
parameters? A message generator?

I know that these are specifics, but I would like to know how these fit into
the architecture? It seems like there could be a URL/message generator which
put potential URLs into the queue and if they didn't exist then it would
just log that. Is this what you have architected?

Question 2.

Also, should there be a layer for post processing or is that a filter. So if
you got an xml file and wanted to transform it, then you could use a filter?

This sourcing might also work really well with the Cocoon project.

Question 3.
Is there any built in support for only getting new files (or updating
changed URLs) or is that the job of the storage layer?

What would be the unique ID for each retrieved in the storage URL/dynamic
URL? The same URL could have different data.

Question 4.
URLScopeFilter - Is this just a wildcard based system, or does it also
handle full regex?

Question 5.
How do you define the pipeline. Right now you have Standardpipeline, but I
don't see a configuration for what the standardpipeline is.

Question 6.
Content validator. Is there anywhere which would be programmed in to say,
only the date as changed on this web page and so I don't want to consider it
changed or update it.

Question 7.
Notification - Is there someway, to notify (via email) that a change has
occurred to a given file, or a new file is available. Is the thought that
this would be part of logging?

I know that some of these questions are very specific, but I think it might
provide a validation of good architecture to see how these fit in.  I think
that the idea of a message queue that either gets filled by someone's own
generator or based on other links embedded in a web page provides a very
flexible architecture.

I like the idea of the overwriting config files, but I am personally a fan
of one big file that the user can configure. This seems to lend it self to
less debugging. So maybe instead of multiple config files based on the url,
maybe a master set of config options with the overwriting parameters based
on the url pattern.
Something like

    <site urlMatch="*">

So here the default would follow all links, but if it were in
then it would only follow .html links. I don't know if this is a real
parameter, just an example. This is how the apache web server works. Tomcat
works in a similar way, but there are different files (web.xml), although
this is mostly because they are completely different applications
potentially written by different people.

Thanks for sharing.


On 7/13/02 11:53 AM, "Clemens Marschner" <> wrote:

> ok, this is my proposal for the crawler configuration. And you tell me if
> I'm reinventing the wheel:
> Overview
> --------
> I distinguish (logically, not necessarily on a class level) between 5
> different types of components:
> - "filters" are parts of the message pipeline. They get a message and either
> pass it on or not. They are put into a messageHandler pipeline and are
> notified about their insertion. Filters don't know about each other. If they
> share common data, this has to be kept on the Service level
> - "services" are things like the host manager, probably a logfile manager,
> and other things that the other components share. The other components
> should be able to access these services, but the services should not know
> about them.
> - "storages" (or sinks) are where the documents go after they have been
> fetched
> - "sources" . are sources of messages (i.e. URLMessages). They typically run
> within their own thread of control and know the messageHandler.
> - then there are some "steering" components that monitor the pipeline and
> probably reconfigure it. They build the infrastructure. The ThreadMonitor
> gathers runtime information. If we want to have this information displayed
> the way we do it now, we need it to know all the other components. I'd leave
> that as it is at the moment, we could change it later. But I'd like the
> configuration component to know as little as possible about the other
> components. See below how I'd like to achieve that.
> Layer Diagram
> -------------
> ---------------|-------------------------------------
>              |          MessageHandler(?)
>              |-   -   -   -   -  -  -  -   -  -  -
> ThreadMon->    |source | filter | filter... | storage
>              |
>              |--------------|----------------------
> Configurator-> |              v
>              |          Services
> ---------------|-------------------------------------
> I'm not quite sure where the MessageHandler fits in here. Is it also a
> service? I like a layered model better.
> The other possibility would be to regard all components as being independent
> and on the same level. But the configurator keeps track of the interactions
> between them.
> Configuration
> -------------
> Prerequisite: All the components mentioned are implemented as JavaBeans
> (tada, the main idea today!)
> Then we can use bean utility classes to set their properties. I've had a
> look at jakarta-commons which contains a BeanUtils package which should
> contain whatever we need.
> since every service/filter is a singleton, we can distinguish it in the
> property file by its class name. If we ever need two instances of a class,
> we'd have to change that. But for simplicity, I think this will do well at
> this time.
> Then I think we can use a syntax of the property file like
> <ClassName>.<propertyName>=<PropertyValue>
> "ClassName" can be fully qualified (i.e. with package) or we could assume a
> default package like "de.lanlab.larm.fetcher". This could serve us well if
> the package name changes.
> [If the class name is fully qualified, however, we'd have a problem with
> nested property names like "", however]
> The Configurator
> ----------------
> The configurator should be capable of the following:
> - divide class names from property names.
> - initialize the classes found
> - register the instances in some kind of naming service (i.e. a global
> HashMap)
> - find and resolve dependencies among the different components
> - set the properties according to the props file (using BeanUtil's
> PropertyUtils.set(|Mapped|Indexed)Property())
> - provide a decent error handling (i.e. return line numbers if exceptions
> are thrown)
> Connecting different components:
> --------------------------------
> I don't want components to create other components or services. This should
> be done by the configurator. I can imagine two ways how components may be
> connected:
> - They tell the configurator that they need this or that service. I.e. the
> VisitedFilter needs the HostManager.
> - A property contains Service Names. Than these services have to be set up
> before the property is set.
> Therefore, the config process needs to be at least twofold: In a first step
> the components are set up and initialized, and in a second step, connections
> between components are set up.
> Config File Overlays
> --------------------
> I had the same idea as Andrew about how config files should be able to
> overwrite each other.
> Internally all properties are treated equally. But the user has to be able
> to distinguish several layers of configurations: I.e. a specific setup of
> the components that is reused every time, but different domains to be
> crawled.
> Therefore I propose that different config files can be specified which are
> loaded subsequently, probably overwriting properties already specified. I.e.
> java ...
> de.lanlab... -DFetcher.threads=50
> which means: is loaded first, then is
> included and probably overwrites some of the settings in,
> and at last the property Fetcher.threads is set manually.
> I know that the JRun server uses a similar method: There you have one
> and a for each server process instance. I
> always found this very useful.
> Example Property File:
> ---------------------
> # do we need this?
> # MessageHandler is initialized first and gets the filters property set.
> # those filters have to be initialized in a second step, when all is set up.
> MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
> RLVisitedFilter,KnownPathsFilter
> # configurator knows here that we need a MessageHandler, so the
> line above is redundant in this case
> #
> LoggerService.baseDir=logs/
> LoggerService.logs=store,links  # defines property names used below
> #
> LoggerService.logs.links.fileName=links.log
> StoragePipeline.docStorages=LogStorage
> StoragePipeline.linkStorages=LinkLogStorage,MessageHandler
> LogStorage.log=store   # the log name from the logger service
> LinkLogStorage.log=links
> # LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
> # LuceneStorage.createIndex=true
> # LuceneStorage.indexName=luceneIndex
> # LuceneStorage.fieldInfos=url,content
> # LuceneStorage.fieldInfos.url = Index,Store
> # LuceneStorage.fieldInfos.content = Index,Store,Tokenize
> # manually define host synonyms. I don't know if there's a better way than
> the following, and if the method used here is possible anyway (one property
> two times)
> # or
> #,
> #,
> # coded as void setAddSynonym(String) - not so nice
> # alternative:
> HostManager.synonyms[0],
> HostManager.synonyms[1],
> # but this would prevent adding further synonyms in other config files
> URLScopeFilter.inScope=http://.*myHost.*
> # or additionally URLScopeFilter.outOfScope=... ?
> # RobotExclusionFilter doesn't have properties. It just needs to know the
> host manager. MessageHandler should
> # make clear that the filter has to be initialized. I think both have to
> provide a method like
> # 'String[] componentsNeeded()' that return the component names to set up.
> # MessageHandler would return the value as specified in
> "MessageHandler.filters", REFilter would return
> # HostManager
> HTTPProtocol.extractGZippedFiles=false
> URLLengthFilter.maxURLLength=255
> Fetcher.threadNumber=25
> Fetcher.docStorage=StoragePipeline
> Fetcher.linkStorage=StoragePipeline
> # here comes the MIME type stuff, which is not yet implemented. Only HTML is
> parsed, the rest is stored as-is.
> # this is an example of another storage:
> # SQLStorage.driver=com.ashna.JTurbo.driver.Driver
> # SQLStorage.url=jdbc:JTurbo://host/parameters
> # SQLStorage.user=...
> # SQLStorage.password=...
> Some Closing Remarks
> --------------------
> Ok you made it until here. Very good.
> I think with this configuration LARM can be much more than just a crawler.
> With a few changes it can also be used as a processor for documents that
> come over the network, i.e. in a JMS topic.
> I haven't mentioned what I call "Sources" or message producers. These are
> active components that run in their own thread and put messages into the
> queue.
> If we have a JMSStorage and a JMSSource, then the crawler can be divided
> into two pieces just from the config file.
> part one:   Fetcher -> JMSStorage
> part two:   JMSSource -> LuceneStorage
> with all the possibilities for distribution.
> Given a different source, one could also imagine feeding the crawler with
> files or with URLs from a web frontend.
> Clemens
> --------------------------------------
> --
> To unsubscribe, e-mail:   <>
> For additional commands, e-mail: <>

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message