cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <je...@apache.org>
Subject cli.xconf questions
Date Sat, 02 Aug 2003 12:08:21 GMT
Hi,

I'm tinkering around with the CLI, thinking how to add
don't-crawl-this-page support, and have some questions on how cli.xconf
currently works.  The following block in cli.xconf has me confused..


  |  The old behaviour - appends uri to the specified destination
  |  directory (as specified in <dest-dir>):
  |
  |   <uri>documents/index.html</uri>

Do we still want this <uri>...</uri> behaviour?  Currently the CLI only
accepts <uri src="..."/>.  Come to think of it, the attribute name 'src'
doesn't really make sense.  What is the "source" of a Cocoon URI?  It
would be the XML (documents/index.xml), which is not what we're
specifying in @src.

  |  Append: append the generated page's URI to the end of the 
  |  source URI:
  |
  |   <uri type="append" src-prefix="documents/" src="index.html"
  |   dest="build/dest/"/>

What is a 'source URI' here, and why would we want to append another URI
(URIs are not additive)?  Does this mean documents/index.html would be
written to build/dest/?  If so, why separate @src-prefix and @src?

  |
  |  Replace: Completely ignore the generated page's URI - just 
  |  use the destination URI:
  |
  |   <uri type="replace" src-prefix="documents/" src="index.html" 
  |   dest="build/dest/docs.html"/>

Sounds fine, but again, since we know the whole URI
(documents/index.html), why separate into @src-prefix and @src?

  |
  |  Insert: Insert generated page's URI into the destination 
  |  URI at the point marked with a * (example uses fictional 
  |  zip protocol)
  |
  |   <uri type="insert" src-prefix="documents/" src="index.html" 
  |   dest="zip://*.zip/page.html"/>

Leaves me very confused.. what would be the result here?  An index.zip
file, containing the bytes from documents/index.html saved as page.html?
Is there a non-fictional scenario where this makes more sense? :)


Anyway, on to the subject of excluding certain URIs.. are there any
preferred ways of doing it?  I've currently got:

  <ignore-uri>....</ignore-uri>

working, which seems crude but effective.  Ideally I'd like to:
 - Use wildcards ("don't crawl '*.xml' URLs")
 - be able to exclude links based on which page they originate from
   ("ignore broken links from sitemap-ref.html")

I was thinking of some sort of nesting notation for indicating links from
a certain page:

  <!-- Ignore *.xml links from sitemap-ref.* -->
  <ignore from-uri="sitemap-ref.*"> 
      <uri>*.xml</uri>   
  </ignore>

Sorry I don't have any answers or even particularly coherent questions ;)
I have the feeling that cli.xconf's job, mapping URIs to the filesystem,
could potentially be quite intricate.  It is roughly an inverse of what
the sitemap does.  Perhaps we need an analogous syntax?


--Jeff


Mime
View raw message