cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giacomo Pati <pati_giac...@yahoo.com>
Subject Re: [C2] Link filtering and Content aggregation
Date Thu, 05 Oct 2000 12:37:40 GMT

--- Ross Burton <ross.burton@mail.com> wrote:
> > Has this something to do with the known "robot.txt" file used to
> > prevent spiders from stepping into specific URIs?
> > 
> > Shouldn't we express the crawl attribute to the outside by a
> request
> > URI to "robot.txt"? Or is crawling from the commandline and
> crawling by
> > a spider different? The sitemap can check that uri if it fails to
> > select a resource in a pipeline (falling through all matches).
> 
> Good point.  Does anyone have the robot.txt spec so we can decide
> this?

You can find a spec at
http://info.webcrawler.com/mak/projects/robots/norobots.html 
It states that the local URL "/robot.txt" must define all the URL not
crawlable. In C2 this will mean that the root sitemap has to ask all
sub sitemaps (and they their own) to deliver not crawlable URIs.

> > Anybody in touch with those error numbers used? Are there any free
> to
> > use to implement custom needs?
> 
> Assuming by error numbers you mean HTTP errors (which is sensible), I
> just happen to have RFC 2616 (also know as HTTP/1.1) to hand, so I'll
> extract it here:

If we implement the /robot.txt URI there is no Status code necessary
(except 404 "not found" )

> > > I take it that in this example the resource /sitebar returns the
> XML:
> > > 
> > >   <sitebar>
> > >     <item xlink:href=".."/>
> > >     ....
> > >   </sitebar>
> > 
> > Are you sure this should return the XML? Is this an implicit
> > "cocoon-view=first" parameter?
> 
> I was assuming that the /sitebar resource used the XML serializer,
> and
> the included XML was that output from the serializer.
> 
> > > This requires a custom URL handler, doesn't it?  How is this
> going to
> > > be
> > > handled?  org.apache.cocoon.utils.URL?
> > 
> > I don't know if this is possible. Does such a custom URL handler
> have
> > all the information necessary to fulfill that need? Wouldn't it be
> > better the sitemap engine itself checks this and somehow
> recursively
> > calls itself?
> 
> As I see it, the cocoon: URL handler would be part of the C2 package,
> so
> has access to the internal workings of the C2 engine.

Yes, you're right. We have to load the necessary information to it.

Giacomo

=====
--
PWR GmbH, Organisation & Entwicklung      Tel:   +41 (0)1 856 2202
Giacomo Pati, CTO/CEO                     Fax:   +41 (0)1 856 2201
Hintereichenstrasse 7                     Mailto:Giacomo.Pati@pwr.ch
CH-8166 Niederweningen                    Web:   http://www.pwr.ch

__________________________________________________
Do You Yahoo!?
Yahoo! Photos - 35mm Quality Prints, Now Get 15 Free!
http://photos.yahoo.com/

Mime
View raw message