cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [C2] Link filtering and Content aggregation
Date Thu, 05 Oct 2000 10:42:15 GMT
Ross Burton wrote:
> 
> > Has this something to do with the known "robot.txt" file used to
> > prevent spiders from stepping into specific URIs?
> >
> > Shouldn't we express the crawl attribute to the outside by a request
> > URI to "robot.txt"? Or is crawling from the commandline and crawling by
> > a spider different? The sitemap can check that uri if it fails to
> > select a resource in a pipeline (falling through all matches).
> 
> Good point.  Does anyone have the robot.txt spec so we can decide this?
> 
> > Anybody in touch with those error numbers used? Are there any free to
> > use to implement custom needs?
> 
> Assuming by error numbers you mean HTTP errors (which is sensible), I
> just happen to have RFC 2616 (also know as HTTP/1.1) to hand, so I'll
> extract it here:
> 
> --------
> 
> 6.1.1 Status Code and Reason Phrase
> 
>    The Status-Code element is a 3-digit integer result code of the
>    attempt to understand and satisfy the request. These codes are fully
>    defined in section 10. The Reason-Phrase is intended to give a short
>    textual description of the Status-Code. The Status-Code is intended
>    for use by automata and the Reason-Phrase is intended for the human
>    user. The client is not required to examine or display the Reason-
>    Phrase.
> 
>    The first digit of the Status-Code defines the class of response. The
>    last two digits do not have any categorization role. There are 5
>    values for the first digit:
> 
>       - 1xx: Informational - Request received, continuing process
> 
>       - 2xx: Success - The action was successfully received,
>         understood, and accepted
> 
>       - 3xx: Redirection - Further action must be taken in order to
>         complete the request
> 
>       - 4xx: Client Error - The request contains bad syntax or cannot
>         be fulfilled
> 
>       - 5xx: Server Error - The server failed to fulfill an apparently
>         valid request
> 
>    The individual values of the numeric status codes defined for
>    HTTP/1.1, and an example set of corresponding Reason-Phrase's, are
>    presented below. The reason phrases listed here are only
>    recommendations -- they MAY be replaced by local equivalents without
>    affecting the protocol.
> 
>       Status-Code    =
>             "100"  ; Section 10.1.1: Continue
>           | "101"  ; Section 10.1.2: Switching Protocols
>           | "200"  ; Section 10.2.1: OK
>           | "201"  ; Section 10.2.2: Created
>           | "202"  ; Section 10.2.3: Accepted
>           | "203"  ; Section 10.2.4: Non-Authoritative Information
>           | "204"  ; Section 10.2.5: No Content
>           | "205"  ; Section 10.2.6: Reset Content
>           | "206"  ; Section 10.2.7: Partial Content
>           | "300"  ; Section 10.3.1: Multiple Choices
>           | "301"  ; Section 10.3.2: Moved Permanently
>           | "302"  ; Section 10.3.3: Found
>           | "303"  ; Section 10.3.4: See Other
>           | "304"  ; Section 10.3.5: Not Modified
>           | "305"  ; Section 10.3.6: Use Proxy
>           | "307"  ; Section 10.3.8: Temporary Redirect
>           | "400"  ; Section 10.4.1: Bad Request
>           | "401"  ; Section 10.4.2: Unauthorized
>           | "402"  ; Section 10.4.3: Payment Required
>           | "403"  ; Section 10.4.4: Forbidden
>           | "404"  ; Section 10.4.5: Not Found
>           | "405"  ; Section 10.4.6: Method Not Allowed
>           | "406"  ; Section 10.4.7: Not Acceptable
>           | "407"  ; Section 10.4.8: Proxy Authentication Required
>           | "408"  ; Section 10.4.9: Request Time-out
>           | "409"  ; Section 10.4.10: Conflict
>           | "410"  ; Section 10.4.11: Gone
>           | "411"  ; Section 10.4.12: Length Required
>           | "412"  ; Section 10.4.13: Precondition Failed
>           | "413"  ; Section 10.4.14: Request Entity Too Large
>           | "414"  ; Section 10.4.15: Request-URI Too Large
>           | "415"  ; Section 10.4.16: Unsupported Media Type
>           | "416"  ; Section 10.4.17: Requested range not satisfiable
>           | "417"  ; Section 10.4.18: Expectation Failed
>           | "500"  ; Section 10.5.1: Internal Server Error
>           | "501"  ; Section 10.5.2: Not Implemented
>           | "502"  ; Section 10.5.3: Bad Gateway
>           | "503"  ; Section 10.5.4: Service Unavailable
>           | "504"  ; Section 10.5.5: Gateway Time-out
>           | "505"  ; Section 10.5.6: HTTP Version not supported
> 
>    [snip]
> 
>    HTTP status codes are extensible. HTTP applications are not required
>    to understand the meaning of all registered status codes, though such
>    understanding is obviously desirable. However, applications MUST
>    understand the class of any status code, as indicated by the first
>    digit, and treat any unrecognized response as being equivalent to the
>    x00 status code of that class, with the exception that an
>    unrecognized response MUST NOT be cached. For example, if an
>    unrecognized status code of 431 is received by the client, it can
>    safely assume that there was something wrong with its request and
>    treat the response as if it had received a 400 status code. In such
>    cases, user agents SHOULD present to the user the entity returned
>    with the response, since that entity is likely to include human-
>    readable information which will explain the unusual status.

I think we don't need to create a special one... any 4xx errors will
trigger the CLI to skip the resource and write a "resource not available
offline" placeholding page.

> > > I take it that in this example the resource /sitebar returns the XML:
> > >
> > >   <sitebar>
> > >     <item xlink:href=".."/>
> > >     ....
> > >   </sitebar>
> >
> > Are you sure this should return the XML? Is this an implicit
> > "cocoon-view=first" parameter?
> 
> I was assuming that the /sitebar resource used the XML serializer, and
> the included XML was that output from the serializer.

exactly.

More generally, one could select the serializer based on the fact that
the call is internal or external. So, such a sitebar could be HTML if
required by a frame (externally) or simple XML if called internally.
 
> > > This requires a custom URL handler, doesn't it?  How is this going to
> > > be
> > > handled?  org.apache.cocoon.utils.URL?
> >
> > I don't know if this is possible. Does such a custom URL handler have
> > all the information necessary to fulfill that need? Wouldn't it be
> > better the sitemap engine itself checks this and somehow recursively
> > calls itself?
> 
> As I see it, the cocoon: URL handler would be part of the C2 package, so
> has access to the internal workings of the C2 engine.

You both are right. But Giacomo was proposing to avoid using the URL
handling framework (which sucks for server side operation, BTW) and make
the sitemap (at compilation time, I would guess) parse the URI itself
and place calls to the other resources directly.

Talking about this, we must add to the enviornment a way to specify if
the call was internal or external so that we can create
matchers/selectors that build on this to avoid internal URI to be
available from the external.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------



Mime
View raw message