lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: LARM: Configuration RFC
Date Thu, 18 Jul 2002 06:09:35 GMT
Clemens,

My overall impression is that this is overly complicated.
My brain is probably tired (past 1 AM), but I can't help but think that
there must be a simpler way....

Also, I believe this thread morphed in a thread about whether LARM
could/should be built as an Avalon component.  I don't know enough
about Avalon, but I think configuring your components and using Avalon
overlap only partially.  That is, Avalon can, I think, provide a good
infrastructure, a good container for your code, lifecycle methods and
such, but I'm not sure if it helps with component and system
configuration.

Some comments inlined...

--- Clemens Marschner <Clemens.Marschner@internet.lmu.de> wrote:
> 
> ok, this is my proposal for the crawler configuration. And you tell
> me if
> I'm reinventing the wheel:
> 
> Overview
> --------
> 
> I distinguish (logically, not necessarily on a class level) between 5
> different types of components:
> - "filters" are parts of the message pipeline. They get a message and
> either
> pass it on or not. They are put into a messageHandler pipeline and
> are notified about their insertion.

Who/what is 'their' here?
Messages are put in the pipeline and filters are notified of their
insertion?

> Filters don't know about each other.
> If they
> share common data, this has to be kept on the Service level
> - "services" are things like the host manager, probably a logfile
> manager,
> and other things that the other components share. The other
> components
> should be able to access these services, but the services should not
> know
> about them.
> - "storages" (or sinks) are where the documents go after they have
> been fetched

Maybe this is just a confusing term to me (storages).
When you fetch a link, what do you do with it?
Do you store the page (HTML and all)?
If so, where do you store it?  File system?

Or do you parse it with one of the filters, extract links with another
filter, and send extracted links to URL queue, and extracted text to
LuceneStorage?

> - "sources" . are sources of messages (i.e. URLMessages). They
> typically run
> within their own thread of control and know the messageHandler.
> - then there are some "steering" components that monitor the pipeline
> and
> probably reconfigure it. They build the infrastructure. The
> ThreadMonitor
> gathers runtime information. If we want to have this information
> displayed
> the way we do it now, we need it to know all the other components.
> I'd leave
> that as it is at the moment, we could change it later. But I'd like
> the
> configuration component to know as little as possible about the other
> components. See below how I'd like to achieve that.
> 
> 
> Layer Diagram
> -------------
> 
> 
> ---------------|-------------------------------------
>                |          MessageHandler(?)
>                |-   -   -   -   -  -  -  -   -  -  -
> ThreadMon->    |source | filter | filter... | storage
>                |
>                |--------------|----------------------
> Configurator-> |              v
>                |          Services
> ---------------|-------------------------------------
> 
> I'm not quite sure where the MessageHandler fits in here. Is it also
> a
> service? I like a layered model better.

I'm also reading your PDF document (version 0.5) now.
One thing seems 'wrong' to me.
If I understand things correctly, you have:
[url/message queue] -> [filter1] ... [filterN] -> [fetcher]

This is the pipeline, correct?

This sounds like it is reversed to me.
Wouldn't this be better:

[url/message queue] -> [fetcher] -> [filter1] ... [filterN]

In English:
- get the next URL (or batch of URLs) to fetch, from the queue
- fetch the URL
- Pass the fetched page through different filters in the pipeline
  (.e.g filter to extract links
        filter to check each link against restrictto pattern
        filter to check each link against 'Visited' list
        put any remaining links (not filtered out) to URL queue
        filter to extract text for indexing (e.g. HTML parser)
        filter to store the extracted text (e.g. Lucene Storage)
        filter to mark the fetched URL as fetched, set last fetched
date, etc.
  )

Wouldn't that be better?
If I understand things correctly, you do this in the opposite order,
which, I think, means that you store all the extacted links in the URL
queue and filter 'bad' ones out only right before fetching.
If that is so, your URL queue is going to be unecessarily large.

What am I missing? (other than sleep)

> The other possibility would be to regard all components as being
> independent
> and on the same level. But the configurator keeps track of the
> interactions between them.

Do you really need an external Configurator component to configure
other components?
Why not have each component configure itself?
Each component can get its own properies, set its own attributes.
You would need only 1 place to glue them all together.
This would be in Java, and may look something like this:

	fetcher      = new UrlFetcher();
	indexer      = new UrlIndexer();
        persister    = new UrlPersister();
	sweeper      = new Sweeper();
	errorHandler = new ErrorHandler();
        ...
        ...
	mds.addServerCommand(fetcher);
	mds.addServerCommand(indexer);
	mds.addServerCommand(persister);
	mds.addServerCommand(sweeper);
	mds.start();
	scheduler.setOutQueue(mds.getInQueue());
	try
	{
	    scheduler.start();
	}
        ...

You get the idea.
Here, Scheduler is the component that talks to the URL queue and puts
messages containing URLs in the processing queue (the pipeline).

The 'mds' instance that you see above knows to pass messages from one
component to the next in the above order.

So only the file where the above code is entered needs to know about
different components (filters, URL and page processors, storages).

A while ago you mentioned you wanted to provide different sets of
components, different pipelines (pipelines with different sets of
filters, etc.).
To do that you would either need to create (hard-code) a few common
sets in Java, like above example for 1 set of components, or you could
come up with a way to read the components from a file (properties or
XML or custom format) which will tell your 'configurator' component
which components to instantiate and how to wire them together into a
pipeline.
.....which is, I guess, what you are asking further down.

> Configuration
> -------------
> 
> Prerequisite: All the components mentioned are implemented as
> JavaBeans
> (tada, the main idea today!)
> 
> Then we can use bean utility classes to set their properties. I've
> had a
> look at jakarta-commons which contains a BeanUtils package which
> should
> contain whatever we need.
> 
> since every service/filter is a singleton, we can distinguish it in
> the
> property file by its class name. If we ever need two instances of a
> class,
> we'd have to change that. But for simplicity, I think this will do
> well at this time.

You will want to change that or you'll be sorry when one of your
filters becomes a bottleneck and you can't instantiate more of them :)

> Then I think we can use a syntax of the property file like
> 
> <ClassName>.<propertyName>=<PropertyValue>
> 
> "ClassName" can be fully qualified (i.e. with package) or we could
> assume a
> default package like "de.lanlab.larm.fetcher". This could serve us
> well if
> the package name changes.
> [If the class name is fully qualified, however, we'd have a problem
> with
> nested property names like "package.class.foo.bar", however]

Aren't there projects under Jakarta Commons that can eliminate the need
for custom code to translate properties to java beans attributes?
Digester maybe?

> The Configurator
> ----------------
> 
> The configurator should be capable of the following:
> - divide class names from property names.
> - initialize the classes found
> - register the instances in some kind of naming service (i.e. a
> global
> HashMap)
> - find and resolve dependencies among the different components
> - set the properties according to the props file (using BeanUtil's
> PropertyUtils.set(|Mapped|Indexed)Property())
> - provide a decent error handling (i.e. return line numbers if
> exceptions
> are thrown)

The first 3 points (dashes) can be taken care of by using k2d2.org
framework, or, I assume, Avalon (hm, that's a big guess, I don't really
know).

> Connecting different components:
> --------------------------------
> 
> I don't want components to create other components or services. This
> should
> be done by the configurator. I can imagine two ways how components
> may be
> connected:
> - They tell the configurator that they need this or that service.
> I.e. the
> VisitedFilter needs the HostManager.
> - A property contains Service Names. Than these services have to be
> set up
> before the property is set.
> Therefore, the config process needs to be at least twofold: In a
> first step
> the components are set up and initialized, and in a second step,
> connections
> between components are set up.

Ok, I guess that is what I was talking about earlier.
I'm not a fan of using XML for config files if you can use simpler
name=value properties, but this sounds kind of 'structured', so an XML
config file may be helpful (or you can tokenize those property values
on commas, but that always looked like a hack to me)


> Config File Overlays
> --------------------
> 
> I had the same idea as Andrew about how config files should be able
> to
> overwrite each other.
> Internally all properties are treated equally. But the user has to be
> able
> to distinguish several layers of configurations: I.e. a specific
> setup of
> the components that is reused every time, but different domains to be
> crawled.
> Therefore I propose that different config files can be specified
> which are
> loaded subsequently, probably overwriting properties already
> specified. I.e.
> 
> java ...
> de.lanlab... -Iglobal.properties -Imycrawl.properties
> -DFetcher.threads=50
> 
> which means: global.properties is loaded first, then
> mycrawl.properties is
> included and probably overwrites some of the settings in
> global.properties,
> and at last the property Fetcher.threads is set manually.
> 
> I know that the JRun server uses a similar method: There you have one
> global.properties and a local.properties for each server process
> instance. I
> always found this very useful.

Yes, me to.
If you want, I can send you a class that does it, so you don't have to
type it up :).  Simple stuff.


> Example Property File:
> ---------------------
> 
>
Configurator.services=HostManager,MessageHandler,LoggerService,HTTPProtocol
> # do we need this?
> 
> # MessageHandler is initialized first and gets the filters property
> set.
> # those filters have to be initialized in a second step, when all is
> set up.
>
MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
> RLVisitedFilter,KnownPathsFilter
> # configurator knows here that we need a MessageHandler, so the
> Configurator.services line above is redundant in this case
> 
> #
> LoggerService.baseDir=logs/
> LoggerService.logs=store,links  # defines property names used below
> # LoggerService.logs.store.class=SimpleLogger
> LoggerService.logs.store.fileName=store.log
> LoggerService.logs.links.fileName=links.log
> LoggerService.logs.store.fileName=store.log
> 
> 
> StoragePipeline.docStorages=LogStorage
> StoragePipeline.linkStorages=LinkLogStorage,MessageHandler
> 
> LogStorage.log=store   # the log name from the logger service
> LinkLogStorage.log=links
> 
> # LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
> # LuceneStorage.createIndex=true
> # LuceneStorage.indexName=luceneIndex
> # LuceneStorage.fieldInfos=url,content
> # LuceneStorage.fieldInfos.url = Index,Store
> # LuceneStorage.fieldInfos.content = Index,Store,Tokenize

I like this Lucene part - specifying fields' characteristics via
properties.

> # manually define host synonyms. I don't know if there's a better way
> than
> the following, and if the method used here is possible anyway (one
> property
> two times)

No, you can't do that, I'm pretty sure.  Properties class is subclass
of Hash(table?) I think, so keys (prop names) would clash.

> HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
> HostManager.synonym=www1.foo.com,www2.foo.com
> # or
> # HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
> # HostManager.addSynonym=www1.foo.com,www2.foo.com
> # coded as void setAddSynonym(String) - not so nice
> 
> # alternative:
> HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
> HostManager.synonyms[1]=www1.foo.com,www2.foo.com
> # but this would prevent adding further synonyms in other config
> files

This is also where XML may help:
<HostSynonym>
  <name>www.example.com</name>
  <syn>www1.example.com</syn>
  <syn>www2.example.com</syn>

  <name>www.porkchop.com</name>
  <syn>www1.porkchop.com</syn>
  <syn>www2.porkchop.com</syn>
</HostSynonym>
   
> URLScopeFilter.inScope=http://.*myHost.*
> # or additionally URLScopeFilter.outOfScope=... ?

Yes, I was going to tell you the other day.
You need 'include pattern' as well as 'exclude pattern'.
Include pattern may be something like *.de, exclude pattern may be
things like /cgi-bin/ or '?' or /wpoison/ or '.cgi' or ...

> # RobotExclusionFilter doesn't have properties. It just needs to know
> the
> host manager. MessageHandler should
> # make clear that the filter has to be initialized. I think both have
> to
> provide a method like
> # 'String[] componentsNeeded()' that return the component names to
> set up.
> # MessageHandler would return the value as specified in
> "MessageHandler.filters", REFilter would return
> # HostManager
> 
> HTTPProtocol.extractGZippedFiles=false
> 
> URLLengthFilter.maxURLLength=255
> 
> Fetcher.threadNumber=25
> Fetcher.docStorage=StoragePipeline
> Fetcher.linkStorage=StoragePipeline
> # here comes the MIME type stuff, which is not yet implemented. Only
> HTML is
> parsed, the rest is stored as-is.
> 
> 
> # this is an example of another storage:
> 
> # SQLStorage.driver=com.ashna.JTurbo.driver.Driver
> # SQLStorage.url=jdbc:JTurbo://host/parameters
> # SQLStorage.user=...
> # SQLStorage.password=...
> 
> 
> Some Closing Remarks
> --------------------
> 
> Ok you made it until here. Very good.

2 AM. :(

> I think with this configuration LARM can be much more than just a
> crawler.
> With a few changes it can also be used as a processor for documents
> that
> come over the network, i.e. in a JMS topic.
> I haven't mentioned what I call "Sources" or message producers. These
> are
> active components that run in their own thread and put messages into
> the queue.

Like Fetcher threads?

> If we have a JMSStorage and a JMSSource, then the crawler can be
> divided
> into two pieces just from the config file.
>  part one:   Fetcher -> JMSStorage
>  part two:   JMSSource -> LuceneStorage
> with all the possibilities for distribution.
> Given a different source, one could also imagine feeding the crawler
> with files or with URLs from a web frontend.

This sounds like plans of failed 'dot coms' :)
We could do this, and this, and this..... in the end there is no focus,
no product, no company, no jobs.  At least there were good parties,
free food, limos, nice SGI screens, Aeron chairs, and a nice office in
Chinatown ;)

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Autos - Get free new car price quotes
http://autos.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message