lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <carl...@bookandhammer.com>
Subject Re: Configuration RFC
Date Sun, 14 Jul 2002 01:28:28 GMT
On 7/13/02 1:47 PM, "Clemens Marschner" <cmad@lanlab.de> wrote:

> 
>> I think you may have mentioned it, but how do the Sources fit in.
>> 
>> For example, one of the goal I am trying to get would be to get a URL
>> (xyz.html) and then change the url based on some pattern. So change
> xyz.html
>> to xyz.xml and get the .xml as a request.
>> I think you mentioned below that the source would send another request
>> message to get the data and then eat the message if it didn't also want to
>> get the .html file? Or would this be the message handler?
> 
> Hm, I think most of your question could be answered in the file int
> http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/webcr
> awler-LARM/doc/
> 
> Changing the URL would be very easy with a URLRewriteFilter, i.e.
> class RewriteFilter implements Filter {
>   public Message handleMessage(Message m) {
>       URL u = ((URLMessage)m).getURL();
>       // do something with the URL
>       ((URLMessage)m).setURL(u);
>       return m; }}
> 

Great

>> Also, would the source be able to say, I want to get all files which meet
>> this pattern and blindly attempt to get the set of files with changing
>> parameters? A message generator?
> 
> I don't really know if I get your point. What do you want to accomplish?

So there are some sites which have content which matches a given pattern,
like article20020701-1.html.

It is much easier to crawl if you just get the article based on a pattern of
article[date]-[sequence].html then getting to it through links. This is what
I would like be able to accomplish.

> 
>> I know that these are specifics, but I would like to know how these fit
> into
>> the architecture? It seems like there could be a URL/message generator
> which
>> put potential URLs into the queue and if they didn't exist then it would
>> just log that. Is this what you have architected?
> 
> Yes, from what you're writing I think this would be an application.
> 
>> Also, should there be a layer for post processing or is that a filter. So
> if
>> you got an xml file and wanted to transform it, then you could use a
> filter?
> 
> At this point this would be a storage in the storage pipeline, although the
> name is a little misleading.
> Filters are only used to process links (so called URLMessages) before they
> get into the crawler threads. The output of these threads is put into the
> "storage", which can be a storage pipeline that works just like the filter
> pipeline.
> In this storage pipeline you can do with the document whatever you want.
> Even post processing. The object put into the storage is a "WebDocument"
> which contains the URL, the document's title, mime type, size, date, and a
> set of name-value pairs which include the raw document by default.
> 
>> This sourcing might also work really well with the Cocoon project.
> 
> Yes, probably.
> 
>> Is there any built in support for only getting new files (or updating
>> changed URLs) or is that the job of the storage layer?
> 
> I have written an experimental repository that registers itself as a storage
> and a filter. From the storage point of view, it puts all URLs it gets in a
> MySQL database. When acting as a filter, it reads them from the database and
> adds the date when it was last crawled to the URLMessage. Then the crawling
> tasks sends an "If-Modified-Since" header and stops crawling the document if
> it was not modified.
> Unfortunately it turned out that the storage itself is way too slow. Slower
> than crawling all documents from the start.
> I haven't checked it in yet, please let me know if you're interested. The
> point I have with it is that some config stuff is included in the source
> code, and I'd like to move these URLs out of the source code first. That's
> why I put so much emphasis on the configuration issue at this point.
> 
>> What would be the unique ID for each retrieved in the storage URL/dynamic
>> URL? The same URL could have different data.
> 
> hm at the moment the URL itself is the unique ID. What parameters could
> cause the data to be different? I can only imagine the URL, a cookie and the
> time of the crawl. Cookies are managed by the HTTP layer at this time. I
> don't even know exactly how cookies are treated at the moment.
> To be more specific, I haven't expected a single URL to point to different
> kinds of pages, but different URLs to point to the same page. Therefore the
> URLs are "normalized" to lower the chance that a URL is ambiguous. I.e.
> http://host/path1/./path2 is normalized to http://host/path1/path2
> 

This comes up when there is a MVC url methodology or a URL with POST
parameters.
So /app1/ShowResults

Could show lots of different results depending on what were the parameters
passed.




>> Question 4.
>> URLScopeFilter - Is this just a wildcard based system, or does it also
>> handle full regex?
> 
> full Perl5 regex, provided by the Apache ORO library. I.e. I'm using the
> regex
> http://[^/]*\(uni-muenchen\.de\|lmu\.de\|lrz-muenchen\.de\|leo\.org\|student
> enwerk\.mhn\.de\|zhs-muenchen\.de\).*

Great


> 
>> Question 6
>> Content validator. Is there anywhere which would be programmed in to say,
>> only the date as changed on this web page and so I don't want to consider
> it
>> changed or update it.
> 
> You mean computing some sort of checksum (like the "Nilsimsa" mentioned in a
> thread some days ago)? This could probably done within a storage. But you
> need fast access to a repository to accomplish the comparison you mentioned.
> And you'd have to download the file to compute that checksum.

What I was thinking was being able to do a difference and then to say if the
only thing that changed meets this pattern then ignore it as changed. The
idea would be to ignore items like dates or counters which change
dynamically.



> 
>> I like the idea of the overwriting config files, but I am personally a fan
>> of one big file that the user can configure. This seems to lend it self to
>> less debugging. So maybe instead of multiple config files based on the
> url,
>> maybe a master set of config options with the overwriting parameters based
>> on the url pattern.
>> Something like
>> 
>> <config>
>>     <global>
>>         <follow-linkMatchPattern>*</follow-linkMatchPattern>
>>     </global>
>>     <site urlMatch="*.apache.org">
>>         <follow-linkMatchPattern>*.html</follow-linkMatchPattern>
>>     </site>
>> </config>
>> 
>> So here the default would follow all links, but if it were in apache.org
>> then it would only follow .html links. I don't know if this is a real
>> parameter, just an example. This is how the apache web server works.
> Tomcat
>> works in a similar way, but there are different files (web.xml), although
>> this is mostly because they are completely different applications
>> potentially written by different people.
> 
> I don't think this contradicts the way I outlined.
> I think the separation I mentioned is necessary to divide the crawler's
> overall configuration (i.e. which filters are used and how they are put
> together) from the specific parameters for a crawl (like the example you
> mentioned).
> What you mean is overwriting general crawl parameters with specific crawl
> parameters for specific domains. This is a new issue that I haven't
> addressed in the RFC yet.
> 



> Besides:
> you have used an XML format for the configuration. I personally think XML is
> often overengineering. The current proposal comes with Java property files.
> I have found property files to be more straightforward, easier to write,
> easier to read, and you don't need a 2 MB XML parser. I'm a fan of XP's
> "implement it the most simple way you could possibly imagine".
> And the times I have used XML I've used an XML->Java converter (Castor XML)
> that spared me from parsing it "manually"; another tool I don't want to use
> in this project. What do you think? If I had to use XML I'd probably have to
> delve into the Tomcat sources to find out how they cope with config stuff.
> 

I don't care if it's XML format, xml tends to be more clear with
relationships between parameters. Large property like files can get
confusing without good comments.




I'll go and read more about what you have already done and try provide more
constructive comments.

Thanks again for providing this infrastructure. 


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message