cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giacomo Pati <Giacomo.P...@pwr.ch>
Subject Re: [C2] Link filtering and Content aggregation
Date Thu, 05 Oct 2000 20:14:16 GMT
Stefano Mazzocchi wrote:
> 
> Giacomo Pati wrote:
> >
> > --- Ross Burton <ross.burton@mail.com> wrote:
> > > > If you only used stylebook, I know you love it and don't see any
> > > > problems with it: it's a magic tool that does the job for you (more
> > > or
> > > > less an autodoc)... but if you ever tried to write a skin for it...
> > > > well, you know what I mean when I say there are problems.
> > >
> > > Do _not_ remind me of the time I wrote a new skin for Stylebook!
> > >
> > >
> > > > link filtering
> > > > --------------
> > > >
> > > > IMO, we need to expand the sitemap semantics to allow resources to
> > > be
> > > > blocked from CLI crawling. The best way, IMO, is to add a specific
> > > > attribute to the resource indicating elements... these elements are
> > > >
> > > >  - match
> > > >  - mount
> >
> > There is the "select" as well because someone can write a uri-selector
> > based on the selector interface (if you want to apply the crawl
> > attribute deep down the pipeline tree). We have dicided that a pipeline
> > element only has "match" elements as direct children, so that we could
> > say a crawl attribute can be applyed to those immediately "match"
> > elements only.
> 
> Hmmmm, I think that crawling should be applied to "match" only, even if
> the matcher is not using URI to resolve the match. Selection happens
> only after the matching has taken place so this covers them all. Don't
> you think?

Yes, you are right. We dicided that the map:pipeline element has only
map:match elements. 

> > > > and we just have to define an attribute name between
> > > >
> > > >  - crawl
> > > >  - crawlable
> > > >  - walk
> > > >  - walkable
> > > >  - ???
> >
> > Has this something to do with the known "robot.txt" file used to
> > prevent spiders from stepping into specific URIs?
> 
> more or less.... but it's pretty easy to have an implicit "robot.txt"
> resource directly created by Cocoon even if the file is not present
> based on sitemap parameters.

Yes, but (after reading it) the robot.txt spec says that there is only
one robot.txt and the request URI is "/robot.txt" for the hole site (and
not as a sub context like "/cocoon/robot.txt".

> > Shouldn't we express the crawl attribute to the outside by a request
> > URI to "robot.txt"?
> 
> exactly

I must disagree after reading the robot.txt spec. It's not possible for
cocoon.

> > Or is crawling from the commandline and crawling by
> > a spider different?
> 
> good point, didn't think of that. what do you think?

Using /robot.txt means writing the robot.txt by hand, deploying it into
the root context and not specifying it in the sitemap. If we can't
exactly simulate a command line environment (like the http environment)
we need to distinguish between them because in fact there is no
differnce between a spider and a browser.

> 
> > The sitemap can check that uri if it fails to
> > select a resource in a pipeline (falling through all matches).
> 
> right.
> 
> > > >
> > > > for example
> > > >
> > > >  <map:match patter="someuri" crawl="no">
> > > >   ..
> > > >  </map:match>
> > > >
> > > > will return a specific error number to the CLI requesting the page.
> >
> > Anybody in touch with those error numbers used? Are there any free to
> > use to implement custom needs?
> >
> > > >
> > > > What do you think?
> > >
> > > The sitemap needs this sort of flexibility, there could be a section
> > > of
> > > the URI space which could potentially return gigabytes of files (for
> > > an
> > > example, see rpmfind.net).  I'm +1 on... crawl="yes|no".
> >
> > I suggest that the Environment interface needs to be expanded for that
> > to make the sitemap engine able to query if a crawl is taking place (if
> > we don't choose the "robot.txt" mentioned above). I still don't want
> > the sitemap engine to deal with the Request/Response/Context objects
> > for that (until someone convice me with a good reason). All the sitemap
> > engine needs must be expressed by the Environment object passed in
> > (even if it may duplicate information available in the
> > Request/Response/Context objects).
> 
> I agree. +1

Ok. 

> 
> > > > Content Aggregation
> > > > -------------------
> > > >
> > >
> > > > It was already proposed to use the "cocoon:" protocol and to access
> > > them
> > >
> > > And I'm a big +20 on this.
> > >
> > > > so
> > > >
> > > >  <sitebar xinclude:href="cocoon:/sitebar"/>
> > > >
> > > > is expanded at runtime as
> > > >
> > > >  <sitebar>
> > > >   <item xlink:href=".."/>
> > > >   <item xlink:href="index"/>
> > > >   <item xlink:href="user-guide"/>
> > > >  </sitebar>
> > >
> > > I take it that in this example the resource /sitebar returns the XML:
> > >
> > >   <sitebar>
> > >     <item xlink:href=".."/>
> > >     ....
> > >   </sitebar>
> >
> > Are you sure this should return the XML? Is this an implicit
> > "cocoon-view=first" parameter?
> 
> no, no, this is not an XInclude, but an XLink, it will simply be
> transformed to <a href=""> and passed to the client, no aggregation
> takes place here.
> 
> > > And _replaces_ the original <sitebar> element.  The same behaviour
> > > would
> > > be the same if the original element was, for example: <foo
> > > xinclude:href="cocoon:/sitebar"/>, right? I'd feel safer using just
> > > <xinclude:include href="cocoon://sitebar"/>, as I think the syntax is
> > > clearer.
> >
> > This calls for the XIncludeTransformer and it seems clearer to me too.
> > Is this where "content aggregation" take place for an example? And
> > where else?
> 
> I think my RT answered this. If not, say so.
> 
> > >
> > > Oh, IIRC the URI RFC states that the format is protocol://host/path,
> > > so
> > > the resource should be cocoon://sitebar or cocoon:///sitebar
> >
> > True!
> 
> false! :)
> 
> cocoon://sitebar is wrong (as Peter correctly stated)... but I'm sure
> *many* will get it wrong so it's not big deal to ingnore the purity of
> the URI spec and allow this to work as well. I already picture "tons" of
> user emails about this :// not working. :(
> 
> > > depending
> > > on the sitemap.
> > >
> > > This requires a custom URL handler, doesn't it?  How is this going to
> > > be
> > > handled?  org.apache.cocoon.utils.URL?
> >
> > I don't know if this is possible. Does such a custom URL handler have
> > all the information necessary to fulfill that need? Wouldn't it be
> > better the sitemap engine itself checks this and somehow recursively
> > calls itself?
> 
> Totally. +1000 to this until we have a better URL handling package...
> and it will take a while given current Avalon status and my time :(
> 
> > >
> > > Ross Burton
> >
> > Giacomo
> >
> > =====
> > --
> > PWR GmbH, Organisation & Entwicklung      Tel:   +41 (0)1 856 2202
> > Giacomo Pati, CTO/CEO                     Fax:   +41 (0)1 856 2201
> > Hintereichenstrasse 7                     Mailto:Giacomo.Pati@pwr.ch
> > CH-8166 Niederweningen                    Web:   http://www.pwr.ch
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Yahoo! Photos - 35mm Quality Prints, Now Get 15 Free!
> > http://photos.yahoo.com/
> 
> --
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <stefano@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------
>  Missed us in Orlando? Make it up with ApacheCON Europe in London!
> ------------------------- http://ApacheCon.Com ---------------------

-- 
PWR GmbH, Organisation & Entwicklung      Tel:   +41 (0)1  856 2202
Giacomo Pati, CTO/CEO                     Fax:   +41 (0)1  856 2201
Hintereichenstrasse 7                     Mobil: +41 (0)78 759 7703
CH-8166 Niederweningen                    Mailto:Giacomo.Pati@pwr.ch
                                          Web:   http://www.pwr.ch

Mime
View raw message