lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clemens Marschner" <>
Subject Re: Configuration RFC
Date Sun, 14 Jul 2002 09:22:03 GMT
> >> Also, would the source be able to say, I want to get all files which
> >> this pattern and blindly attempt to get the set of files with changing
> >> parameters? A message generator?
> >
> > I don't really know if I get your point. What do you want to accomplish?
> So there are some sites which have content which matches a given pattern,
> like article20020701-1.html.
> It is much easier to crawl if you just get the article based on a pattern
> article[date]-[sequence].html then getting to it through links. This is
> I would like be able to accomplish.

This could be done in two ways:
Either you discover such a pattern in a filter. Then this filter could
generate new messages and put them in front of the message handler queue.
I was already thinking of a very "greedy" crawling mechanism where for each
URL found leads to one message for each directory it is contained in. I.e.
     http://host/my/little/path/page.html ->
     ->   http://host/my/little/path/
     ->   http://host/my/little/
     ->   http://host/my/
     ->   http://host/
Most of these messages will be filtered by the VisitedFilter, but it can
also discover "hidden" directories... probably more than the web master
would like...

> > hm at the moment the URL itself is the unique ID. What parameters could
> > cause the data to be different? I can only imagine the URL, a cookie and
> > time of the crawl. Cookies are managed by the HTTP layer at this time. I
> > don't even know exactly how cookies are treated at the moment.
> > To be more specific, I haven't expected a single URL to point to
> > kinds of pages, but different URLs to point to the same page. Therefore
> > URLs are "normalized" to lower the chance that a URL is ambiguous. I.e.
> > http://host/path1/./path2 is normalized to http://host/path1/path2
> >
> This comes up when there is a MVC url methodology or a URL with POST
> parameters.
> So /app1/ShowResults
> Could show lots of different results depending on what were the parameters
> passed.

POST operations are not supported at this time. I don't see an application
for that. POST is only used
in forms, where it doesn't make sense for a crawler to enter "some"
information,  or probably with Javascript, for which there is no suitable
parser that detects location.hrefs (will not be easy in any but the most
trivial cases). I also don't know any crawler that does this.

> > You mean computing some sort of checksum (like the "Nilsimsa" mentioned
in a
> > thread some days ago)? This could probably done within a storage. But
> > need fast access to a repository to accomplish the comparison you
> > And you'd have to download the file to compute that checksum.
> What I was thinking was being able to do a difference and then to say if
> only thing that changed meets this pattern then ignore it as changed. The
> idea would be to ignore items like dates or counters which change
> dynamically.

I think this is similar to what I said. There's also a paper by
Garcia-Molina et al. about this topic (see Citeseer, "Finding near replicas
of documents on the web"


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message