lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <c...@lanlab.de>
Subject Re: Configuration RFC
Date Sat, 13 Jul 2002 20:47:33 GMT

> I think you may have mentioned it, but how do the Sources fit in.
>
> For example, one of the goal I am trying to get would be to get a URL
> (xyz.html) and then change the url based on some pattern. So change
xyz.html
> to xyz.xml and get the .xml as a request.
> I think you mentioned below that the source would send another request
> message to get the data and then eat the message if it didn't also want to
> get the .html file? Or would this be the message handler?

Hm, I think most of your question could be answered in the file int
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/webcr
awler-LARM/doc/

Changing the URL would be very easy with a URLRewriteFilter, i.e.
class RewriteFilter implements Filter {
    public Message handleMessage(Message m) {
        URL u = ((URLMessage)m).getURL();
        // do something with the URL
        ((URLMessage)m).setURL(u);
        return m; }}

> Also, would the source be able to say, I want to get all files which meet
> this pattern and blindly attempt to get the set of files with changing
> parameters? A message generator?

I don't really know if I get your point. What do you want to accomplish?

> I know that these are specifics, but I would like to know how these fit
into
> the architecture? It seems like there could be a URL/message generator
which
> put potential URLs into the queue and if they didn't exist then it would
> just log that. Is this what you have architected?

Yes, from what you're writing I think this would be an application.

> Also, should there be a layer for post processing or is that a filter. So
if
> you got an xml file and wanted to transform it, then you could use a
filter?

At this point this would be a storage in the storage pipeline, although the
name is a little misleading.
Filters are only used to process links (so called URLMessages) before they
get into the crawler threads. The output of these threads is put into the
"storage", which can be a storage pipeline that works just like the filter
pipeline.
In this storage pipeline you can do with the document whatever you want.
Even post processing. The object put into the storage is a "WebDocument"
which contains the URL, the document's title, mime type, size, date, and a
set of name-value pairs which include the raw document by default.

> This sourcing might also work really well with the Cocoon project.

Yes, probably.

> Is there any built in support for only getting new files (or updating
> changed URLs) or is that the job of the storage layer?

I have written an experimental repository that registers itself as a storage
and a filter. From the storage point of view, it puts all URLs it gets in a
MySQL database. When acting as a filter, it reads them from the database and
adds the date when it was last crawled to the URLMessage. Then the crawling
tasks sends an "If-Modified-Since" header and stops crawling the document if
it was not modified.
Unfortunately it turned out that the storage itself is way too slow. Slower
than crawling all documents from the start.
I haven't checked it in yet, please let me know if you're interested. The
point I have with it is that some config stuff is included in the source
code, and I'd like to move these URLs out of the source code first. That's
why I put so much emphasis on the configuration issue at this point.

> What would be the unique ID for each retrieved in the storage URL/dynamic
> URL? The same URL could have different data.

hm at the moment the URL itself is the unique ID. What parameters could
cause the data to be different? I can only imagine the URL, a cookie and the
time of the crawl. Cookies are managed by the HTTP layer at this time. I
don't even know exactly how cookies are treated at the moment.
To be more specific, I haven't expected a single URL to point to different
kinds of pages, but different URLs to point to the same page. Therefore the
URLs are "normalized" to lower the chance that a URL is ambiguous. I.e.
http://host/path1/./path2 is normalized to http://host/path1/path2

> Question 4.
> URLScopeFilter - Is this just a wildcard based system, or does it also
> handle full regex?

full Perl5 regex, provided by the Apache ORO library. I.e. I'm using the
regex
http://[^/]*\(uni-muenchen\.de\|lmu\.de\|lrz-muenchen\.de\|leo\.org\|student
enwerk\.mhn\.de\|zhs-muenchen\.de\).*

> Question 5.
> How do you define the pipeline. Right now you have Standardpipeline, but I
> don't see a configuration for what the standardpipeline is.

please refer to the document I mentioned. I don't know what you mean by
Standardpipeline.

> Question 6
> Content validator. Is there anywhere which would be programmed in to say,
> only the date as changed on this web page and so I don't want to consider
it
> changed or update it.

You mean computing some sort of checksum (like the "Nilsimsa" mentioned in a
thread some days ago)? This could probably done within a storage. But you
need fast access to a repository to accomplish the comparison you mentioned.
And you'd have to download the file to compute that checksum.

> Question 7.
> Notification - Is there someway, to notify (via email) that a change has
> occurred to a given file, or a new file is available. Is the thought that
> this would be part of logging?

this could probably done within the logging stuff. Probably replace the
standard logger (which is not thread-safe and thus very fast) with Log4J and
use an appender that suits your needs.
A prerequisite would again be the repository I mentioned. Probably you could
spend some time to make it really fast...

> I like the idea of the overwriting config files, but I am personally a fan
> of one big file that the user can configure. This seems to lend it self to
> less debugging. So maybe instead of multiple config files based on the
url,
> maybe a master set of config options with the overwriting parameters based
> on the url pattern.
> Something like
>
> <config>
>     <global>
>         <follow-linkMatchPattern>*</follow-linkMatchPattern>
>     </global>
>     <site urlMatch="*.apache.org">
>         <follow-linkMatchPattern>*.html</follow-linkMatchPattern>
>     </site>
> </config>
>
> So here the default would follow all links, but if it were in apache.org
> then it would only follow .html links. I don't know if this is a real
> parameter, just an example. This is how the apache web server works.
Tomcat
> works in a similar way, but there are different files (web.xml), although
> this is mostly because they are completely different applications
> potentially written by different people.

I don't think this contradicts the way I outlined.
I think the separation I mentioned is necessary to divide the crawler's
overall configuration (i.e. which filters are used and how they are put
together) from the specific parameters for a crawl (like the example you
mentioned).
What you mean is overwriting general crawl parameters with specific crawl
parameters for specific domains. This is a new issue that I haven't
addressed in the RFC yet.

Besides:
you have used an XML format for the configuration. I personally think XML is
often overengineering. The current proposal comes with Java property files.
I have found property files to be more straightforward, easier to write,
easier to read, and you don't need a 2 MB XML parser. I'm a fan of XP's
"implement it the most simple way you could possibly imagine".
And the times I have used XML I've used an XML->Java converter (Castor XML)
that spared me from parsing it "manually"; another tool I don't want to use
in this project. What do you think? If I had to use XML I'd probably have to
delve into the Tomcat sources to find out how they cope with config stuff.


Clemens






--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message