lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <c...@lanlab.de>
Subject Re: LARM: Configuration RFC
Date Mon, 12 Aug 2002 13:12:16 GMT
> My overall impression is that this is overly complicated.
> My brain is probably tired (past 1 AM), but I can't help but think that
> there must be a simpler way....

Hi Otis,
sorry for the delay,
because of that I will repeat most of my original message:


> > I distinguish (logically, not necessarily on a class level) between 5
> > different types of components:
> > - "filters" are parts of the message pipeline. They get a message and
> > either
> > pass it on or not. They are put into a messageHandler pipeline and
> > are notified about their insertion.
>
> Who/what is 'their' here?
> Messages are put in the pipeline and filters are notified of their
> insertion?

No, filters are installed in the pipeline and get a notification that they
were inserted.

> > Filters don't know about each other.
> > If they
> > share common data, this has to be kept on the Service level
> > - "services" are things like the host manager, probably a logfile
> > manager,
> > and other things that the other components share. The other
> > components
> > should be able to access these services, but the services should not
> > know
> > about them.
> > - "storages" (or sinks) are where the documents go after they have
> > been fetched
>
> Maybe this is just a confusing term to me (storages).
> When you fetch a link, what do you do with it?
> Do you store the page (HTML and all)?
> If so, where do you store it?  File system?

Call it a processor or sink. It's "were the meat goes" after being fetched.
(Sorry, watched too much Seinfeld lately) ["But where do you turn it on?"]

> Or do you parse it with one of the filters, extract links with another
> filter, and send extracted links to URL queue, and extracted text to
> LuceneStorage?

Ok, once more:

  URLMsg.       URLMsg.            WebDoc
  -----> FILTER* -----> FETCHER{1} -----> PROCESSOR+

Right now:
Processing = Part of Fetcher -> bad
PROCESSOR = Storage/StoragePipeline -> just a special kind of processor
I'm still not very confident with the word "processor", though. Should be
something like drain or sink. There's no "is-a" between "Storage" and
"Processor".

> > - "sources" . are sources of messages (i.e. URLMessages). They
> > typically run
> > within their own thread of control and know the messageHandler.
> > - then there are some "steering" components that monitor the pipeline
> > and
> > probably reconfigure it. They build the infrastructure. The
> > ThreadMonitor
> > gathers runtime information. If we want to have this information
> > displayed
> > the way we do it now, we need it to know all the other components.
> > I'd leave
> > that as it is at the moment, we could change it later. But I'd like
> > the
> > configuration component to know as little as possible about the other
> > components. See below how I'd like to achieve that.
> >
> >
> > Layer Diagram
> > -------------
> >
> >
> > ---------------|-------------------------------------
> >                |          MessageHandler(?)
> >                |-   -   -   -   -  -  -  -   -  -  -
> > ThreadMon->    |source | filter | filter... | storage
> >                |
> >                |--------------|----------------------
 > Configurator-> |              v
> >                |          Services
> > ---------------|-------------------------------------
> >
> > I'm not quite sure where the MessageHandler fits in here. Is it also
> > a
> > service? I like a layered model better.
>
> I'm also reading your PDF document (version 0.5) now.
> One thing seems 'wrong' to me.
> If I understand things correctly, you have:
> [url/message queue] -> [filter1] ... [filterN] -> [fetcher]
>
> This is the pipeline, correct?
>
> This sounds like it is reversed to me.
> Wouldn't this be better:
>
> [url/message queue] -> [fetcher] -> [filter1] ... [filterN]

no, since the filters are applied to the URLs. The idea was that URLs are
filtered or changed, i.e. because of robot exclusion etc.

> In English:
> - get the next URL (or batch of URLs) to fetch, from the queue
> - fetch the URL
> - Pass the fetched page through different filters in the pipeline
>   (.e.g filter to extract links
>         filter to check each link against restrictto pattern
>         filter to check each link against 'Visited' list
>         put any remaining links (not filtered out) to URL queue
>         filter to extract text for indexing (e.g. HTML parser)
>         filter to store the extracted text (e.g. Lucene Storage)
>         filter to mark the fetched URL as fetched, set last fetched
> date, etc.
>   )
>
> Wouldn't that be better?

It's all the same, since the whole thing is a CIRCLE. If you draw it from
left to right, put the point where you would insert a URL from the outside
to the left, and we're close together. I put it that way because I think
that every URL should walk through the filter pipeline.
It's also a threading issue: The filter pipeline runs in ONE thread since it
uses a lot of resources that would otherwise have to be shared among the
threads, which doesn't make sense (would slow it down a lot). The processing
(or storage) pipeline itself is done within the fetcher threads, since
processing can be done in parallel, and storage _has_ to be done
synchronized. Otherwise there would have to be a document queue in front of
the storage mechanism, which doesn't make sense.

> If I understand you right, you do this in the opposite order,
> which, I think, means that you store all the extacted links in the URL
> queue and filter 'bad' ones out only right before fetching.
> If that is so, your URL queue is going to be unecessarily large.

No, since URL processing is pretty fast, this queue is usually empty. That's
not a problem.

> > The other possibility would be to regard all components as being
> > independent
> > and on the same level. But the configurator keeps track of the
> > interactions between them.
>
> Do you really need an external Configurator component to configure
> other components?
> Why not have each component configure itself?
> Each component can get its own properies, set its own attributes.
> You would need only 1 place to glue them all together.
> This would be in Java, and may look something like this:
>
> fetcher      = new UrlFetcher();
> indexer      = new UrlIndexer();
>         persister    = new UrlPersister();
> sweeper      = new Sweeper();
> errorHandler = new ErrorHandler();
>         ...
>         ...
> mds.addServerCommand(fetcher);
> mds.addServerCommand(indexer);
> mds.addServerCommand(persister);
> mds.addServerCommand(sweeper);
> mds.start();
> scheduler.setOutQueue(mds.getInQueue());
> try
> {
>     scheduler.start();
> }
>         ...
>
> You get the idea.
> Here, Scheduler is the component that talks to the URL queue and puts
> messages containing URLs in the processing queue (the pipeline).
>
> The 'mds' instance that you see above knows to pass messages from one
> component to the next in the above order.

Have you taken a look at FetcherMain.java? That will look familiar to you,
because the pattern is exactly the same. The attempt I made was to push this
forward to a more generic way.

> A while ago you mentioned you wanted to provide different sets of
> components, different pipelines (pipelines with different sets of
> filters, etc.).
> To do that you would either need to create (hard-code) a few common
> sets in Java, like above example for 1 set of components, or you could
> come up with a way to read the components from a file (properties or
> XML or custom format) which will tell your 'configurator' component
> which components to instantiate and how to wire them together into a
> pipeline.
> .....which is, I guess, what you are asking further down.
yep

> > since every service/filter is a singleton, we can distinguish it in
> > the
> > property file by its class name. If we ever need two instances of a
> > class,
> > we'd have to change that. But for simplicity, I think this will do
> > well at this time.
>
> You will want to change that or you'll be sorry when one of your
> filters becomes a bottleneck and you can't instantiate more of them :)

Don't know. "Do the simplest thing you possibly can". I think keeping it in
mind will be enough not to drive in a dead end.

> Aren't there projects under Jakarta Commons that can eliminate the need
> for custom code to translate properties to java beans attributes?
> Digester maybe?

> > The Configurator
> > ----------------
> >
> > The configurator should be capable of the following:
> > - divide class names from property names.
> > - initialize the classes found
> > - register the instances in some kind of naming service (i.e. a
> > global
> > HashMap)
> > - find and resolve dependencies among the different components
> > - set the properties according to the props file (using BeanUtil's
> > PropertyUtils.set(|Mapped|Indexed)Property())
> > - provide a decent error handling (i.e. return line numbers if
> > exceptions
> > are thrown)
>
> The first 3 points (dashes) can be taken care of by using k2d2.org
> framework, or, I assume, Avalon (hm, that's a big guess, I don't really
> know).

Mehran wanted to have a look at Avalon. I hope he'll finds out if it
provides what we want here. Otherwise, k2d2 is a good hint.

> I'm not a fan of using XML for config files if you can use simpler
> name=value properties, but this sounds kind of 'structured', so an XML
> config file may be helpful (or you can tokenize those property values
> on commas, but that always looked like a hack to me)

The only thing I want to avoid is parsing the XML by hand. I've done this
with Castor XML (which I suppose no one would want to use here), and I think
there're already gazillions of frameworks out there that do exactly that. I
still hope that Avalon can get us through this.

> > # LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
> > # LuceneStorage.createIndex=true
> > # LuceneStorage.indexName=luceneIndex
> > # LuceneStorage.fieldInfos=url,content
> > # LuceneStorage.fieldInfos.url = Index,Store
> > # LuceneStorage.fieldInfos.content = Index,Store,Tokenize
>
> I like this Lucene part - specifying fields' characteristics via
> properties.
>
> > # manually define host synonyms. I don't know if there's a better way
> > than
> > the following, and if the method used here is possible anyway (one
> > property
> > two times)
>
> No, you can't do that, I'm pretty sure.  Properties class is subclass
> of Hash(table?) I think, so keys (prop names) would clash.

ok.

>
> > HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
> > HostManager.synonym=www1.foo.com,www2.foo.com
> > # or
> > # HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
> > # HostManager.addSynonym=www1.foo.com,www2.foo.com
> > # coded as void setAddSynonym(String) - not so nice
> >
> > # alternative:
> > HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
> > HostManager.synonyms[1]=www1.foo.com,www2.foo.com
> > # but this would prevent adding further synonyms in other config
> > files
>
> This is also where XML may help:
> <HostSynonym>
>   <name>www.example.com</name>
>   <syn>www1.example.com</syn>
>   <syn>www2.example.com</syn>
>
>   <name>www.porkchop.com</name>
>   <syn>www1.porkchop.com</syn>
>   <syn>www2.porkchop.com</syn>
> </HostSynonym>
sure

>
> > URLScopeFilter.inScope=http://.*myHost.*
> > # or additionally URLScopeFilter.outOfScope=... ?
>
> Yes, I was going to tell you the other day.
> You need 'include pattern' as well as 'exclude pattern'.
> Include pattern may be something like *.de, exclude pattern may be
> things like /cgi-bin/ or '?' or /wpoison/ or '.cgi' or ...

There's something like that in the KnownPathsFilter class, which I should
sometimes rename to SuckyFilter because it's so horrible. You can define how
URL.path.startsWith() and URL.query.startsWith have to look like, from what
I recall.

> > I think with this configuration LARM can be much more than just a
> > crawler.
> > With a few changes it can also be used as a processor for documents
> > that
> > come over the network, i.e. in a JMS topic.
> > I haven't mentioned what I call "Sources" or message producers. These
> > are
> > active components that run in their own thread and put messages into
> > the queue.
>
> Like Fetcher threads?

Yes. But I think the other cases mentioned are a lot more interesting.

>
> > If we have a JMSStorage and a JMSSource, then the crawler can be
> > divided
> > into two pieces just from the config file.
> >  part one:   Fetcher -> JMSStorage
> >  part two:   JMSSource -> LuceneStorage
> > with all the possibilities for distribution.
> > Given a different source, one could also imagine feeding the crawler
> > with files or with URLs from a web frontend.
>
> This sounds like plans of failed 'dot coms' :)
> We could do this, and this, and this..... in the end there is no focus,
> no product, no company, no jobs.  At least there were good parties,
> free food, limos, nice SGI screens, Aeron chairs, and a nice office in
> Chinatown ;)

I understand your sarcasm at 2:30 AM, but to me it is _very_ simple.
Distributing via JMS can (and will) be made in 2 seconds. To me the config
part is much more complicated, since people will start writing config files
and will get angry if the format is changed a short time later...


Clemens



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message