cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@upaya.co.uk>
Subject Re: 2.1: Neither LinkSerializer nor LinkGatherer producing a complete link list
Date Sun, 31 Aug 2003 19:27:26 GMT
Florian G. Haas wrote:

>Hi again.
>
>On Saturday 30 August 2003 20:08, Upayavira wrote:
>| This is in fact a problem initialising the Deli block. I don't know much
>| about it, and I can't really explain why it should fail in the CLI but
>| not in the servlet, but I'm pretty sure I've seen this before. It is a
>| workaround, not a solution, but if you rebuild Cocoon excluding the Deli
>| block, you'll get rid of this exception. Maybe I should add 'avoid deli'
>| to the CLI docs :-(
>
>OK. I'll ignore this error for now.
>
>| >Setting the logkit level to DEBUG yields these interesting results in
>| >sitemap.log:
>| >
>| >DEBUG   (2003-08-30) 14:03.21:692   [sitemap.generator.file] (Unknown-URI)
>| >Unknown-thread/FileGenerator: processing file src/fgh.xtm
>| >DEBUG   (2003-08-30) 14:03.21:692   [sitemap.generator.file] (Unknown-URI)
>| >Unknown-thread/FileGenerator: file resolved to
>| >file:/home/fgh/public_html/src/fgh.xtm
>| >DEBUG   (2003-08-30) 14:03.22:713   [sitemap] (Unknown-URI)
>| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
>| >URI=http://www.w3.org/1999/xhtml NAME=link RAW=link ATT=href
>| >NS=http://www.w3.org/1999/xhtml VALUE=../css/tm4web.css
>| >DEBUG   (2003-08-30) 14:03.23:928   [sitemap] (Unknown-URI)
>| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
>| >URI=http://www.w3.org/1999/xhtml NAME=a RAW=a ATT=href
>| >NS=http://www.w3.org/1999/xhtml VALUE=mailto:f.g.haas@gmx.net
>| >DEBUG   (2003-08-30) 14:03.23:937   [sitemap] (Unknown-URI)
>| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
>| >URI=http://www.w3.org/1999/xhtml NAME=a RAW=a ATT=href
>| >NS=http://www.w3.org/1999/xhtml
>| > VALUE=http://validator.w3.org/check/referer DEBUG   (2003-08-30)
>| > 14:03.23:938   [sitemap] (Unknown-URI)
>| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
>| >URI=http://www.w3.org/1999/xhtml NAME=img RAW=img ATT=src
>| >NS=http://www.w3.org/1999/xhtml
>| > VALUE=http://www.w3.org/Icons/valid-xhtml10 DEBUG   (2003-08-30)
>| > 14:03.23:939   [sitemap] (Unknown-URI)
>| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
>| >URI=http://www.w3.org/1999/xhtml NAME=a RAW=a ATT=href
>| >NS=http://www.w3.org/1999/xhtml VALUE=http://jigsaw.w3.org/css-validator
>| >DEBUG   (2003-08-30) 14:03.23:940   [sitemap] (Unknown-URI)
>| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
>| >URI=http://www.w3.org/1999/xhtml NAME=img RAW=img ATT=src
>| >NS=http://www.w3.org/1999/xhtml
>| >VALUE=http://jigsaw.w3.org/css-validator/images/vcss.gif
>| >
>| >As pointed out in my earlier reply to Jeff, the result document contains
>| > 12 links. Why is ExtendedXLinkPipe apparently resolving only 6?
>|
>| Can you post some of the document that your scanning? Some with links
>| that are found and some with links that aren't?
>
>Getting back to the output above for a minute, please check if I'm correct on 
>the following points:
>* Link #1 should be crawled as it's the CSS reference, which is available 
>locally.
>* Link #2 should not be crawled since it's a mailto: URI.
>* Links #3,4,5,6 should not be crawled since they are remote links.
>
>The question remains, why is it not crawling all the other links, and also not 
>requesting the locally available CSS? 
>
Are you using a src-prefix when specifying the URL to start at? If so, 
the CLI will not scan up above the level of this prefix.

>BTW, is there a way to tell the CLI to 
>retrieve remote *images* referenced via img src="http://...", even though 
>remote *links* (<a href="http://...">) are omitted? 
>
Interesting idea. That wouldn't be hard to add. But I'd want to rework 
the format of the cli.xconf file, which I don't yet think is ready for 
that sort of thing. But it is certainly possible, and probably not hard 
to add.

>Now, getting to your question, here is the XHTML output of the document I used 
>as an example (this is found at ~fgh/en/index.html in my Cocoon setup). This 
>is somewhat abbreviated, but I'm sure it gets the point across nonetheless:
>
><?xml version="1.0" encoding="ISO-8859-1"?>
><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
>"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
><html xmlns="http://www.w3.org/1999/xhtml">
>  <head>
>    <meta content="text/html;charset=ISO-8859-1" http-equiv="content-type" />
>    <link href="../css/tm4web.css" type="text/css" rel="stylesheet" />
>    <meta content="The dynamic TM4J web application" name="generator" />
>    <title>Florian G. Haas</title>
>  </head>
>  <body>
>    <div id="main">
>      <div class="topicinfo">
>        <div class="about">
>          <h1 class="topictitle">Florian G. Haas</h1>
>          <h3 class="topictype"><a title="person" 
>href="person.html">person</a></h3>
>        </div>
>        <div>
>          <div class="navigation">
>            <div class="navbox">
>              <h4 class="assoctitle"> is interested in, is of interest to</h4>
>              <ul class="assocmembers">
>                <li class="assocmember"><a title="Topic Maps" 
>href="topicmaps.html">Topic Maps</a> <span class="memberrole"> (object
of 
>interest)</span></li>
>                <li class="assocmember"><a title="Java" 
>href="java.html">Java</a> <span class="memberrole"> (object of 
>interest)</span></li>
>              </ul>
>              <h4 class="assoctitle">#subject</h4>
>              <ul class="assocmembers">
>                <li class="assocmember"><a title="Austria" 
>href="austria.html">Austria</a> <span class="memberrole"> 
>(origin)</span></li>
>              </ul>
>              <!-- ... -->
>            </div>
>          </div>
>          <div class="topicoccurs">
>	    <!-- ... --->
>            <ul class="resources">
>              <li>email address: <a 
>href="mailto:f.g.haas@gmx.net">f.g.haas@gmx.net</a></li>
>              <li>PGP key: TODO.</li>
>            </ul>
>          </div>
>        </div>
>      </div>
>    </div>
>    <div id="validation-buttons">
>      <a href="http://validator.w3.org/check/referer" id="valid-xhtml-anchor">
>        <img alt="Valid XHTML 1.0!" 
>src="http://www.w3.org/Icons/valid-xhtml10" id="valid-xhtml-image" />
>      </a>
>      <a href="http://jigsaw.w3.org/css-validator" id="valid-css-anchor">
>        <img alt="Valid CSS!" 
>src="http://jigsaw.w3.org/css-validator/images/vcss.gif" id="valid-css-image" 
>/>
>      </a>
>    </div>
>  </body>
></html>
>
>The only difference between the links that leave their footprints in 
>sitemap.log and the ones that don't is that the latter are more deeply nested 
>inside a buch of <div>'s, but that surely shouldn't make a difference, should 
>it?
>
>| What you can do is improve your link view to use an XSL that simplifies
>| your page down to just the links you want and then run that into the
>| linkSerializer. A hack, but it might at least get you going.
>
>Watch out. Now it's getting really bizarre.
>
>I use this stylesheet in order to extract the links:
>
><?xml version="1.0" encoding="utf-8"?>
><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>                xmlns:xlink="http://www.w3.org/1999/xlink/"
>                xmlns:xhtml="http://www.w3.org/1999/xhtml"
>                version="1.0"
>                exclude-result-prefixes="xhtml">
>  <xsl:output indent="yes"/>
>
>  <xsl:template match="/">
>    <output>
>      <xsl:apply-templates/>
>    </output>
>  </xsl:template>
>
>  <xsl:template match="//*[@href|@src|@xlink:href]">
>    <xsl:copy-of select="."/>
>  </xsl:template>
>
>  <xsl:template match="*">
>    <xsl:apply-templates/>
>  </xsl:template>
>
>  <xsl:template match="text()">
>    <!-- ignore -->
>  </xsl:template>
></xsl:stylesheet>
>
>Now, when I run this through Cocoon, whether in a separate pipeline or in a 
>view, I get this output:
>
><?xml version="1.0" encoding="ISO-8859-1"?>
><output xmlns:xlink="http://www.w3.org/1999/xlink/">
>  <link xmlns="http://www.w3.org/1999/xhtml" href="../css/tm4web.css" 
>type="text/css" rel="stylesheet"/>
>  <a xmlns="http://www.w3.org/1999/xhtml" 
>href="mailto:f.g.haas@gmx.net">f.g.haas@gmx.net</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" 
>href="http://validator.w3.org/check/referer" id="valid-xhtml-anchor">
>  <img alt="Valid XHTML 1.0!" src="http://www.w3.org/Icons/valid-xhtml10" 
>id="valid-xhtml-image"/>
>  </a>
>  <a xmlns="http://www.w3.org/1999/xhtml" 
>href="http://jigsaw.w3.org/css-validator" id="valid-css-anchor">
>  <img alt="Valid CSS!" 
>src="http://jigsaw.w3.org/css-validator/images/vcss.gif" 
>id="valid-css-image"/>
>  </a>
></output>
>
>Which corresponds precisely to the sitemap.log entries cited earlier, some 
>links get recognized, some don't.
>
>Now, I open my document, ~fgh/en/index.html, in a browser (it still displays 
>nicely). I download it to disk. Then, I run it through Xalan (I use 2.5.1), 
>from the command line, using the same XSL stylesheet I reference in my Cocoon 
>setup. Here's the result:
>
><?xml version="1.0" encoding="UTF-8"?>
>  <output xmlns:xlink="http://www.w3.org/1999/xlink/">
>  <link xmlns="http://www.w3.org/1999/xhtml" href="../css/tm4web.css" 
>type="text/css" rel="stylesheet"/>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="person.html" 
>shape="rect">person</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="topicmaps.html" 
>shape="rect">Topic Maps</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="java.html" 
>shape="rect">Java</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="austria.html" 
>shape="rect">Austria</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="arno.html" shape="rect">Arno 
>Sosna</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="tm4j.html" shape="rect">the 
>TM4J project</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="fhib.html" shape="rect">FHIB, 
>Fachhochschul-Studiengang Informationsberufe</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="hico.html" 
>shape="rect">hico</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="this-website.html" 
>shape="rect">this website</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="my-writings.html" 
>shape="rect">my writings</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" href="mailto:f.g.haas@gmx.net" 
>shape="rect">f.g.haas@gmx.net</a>
>  <a xmlns="http://www.w3.org/1999/xhtml" 
>href="http://validator.w3.org/check/referer" id="valid-xhtml-anchor" 
>shape="rect">
>  <img alt="Valid XHTML 1.0!" src="http://www.w3.org/Icons/valid-xhtml10" 
>id="valid-xhtml-image"/>
>  </a>
>  <a xmlns="http://www.w3.org/1999/xhtml" 
>href="http://jigsaw.w3.org/css-validator" id="valid-css-anchor" shape="rect">
>  <img alt="Valid CSS!" 
>src="http://jigsaw.w3.org/css-validator/images/vcss.gif" 
>id="valid-css-image"/>
>  </a>
></output>
>
>Which is precisely what I want! (nevermind the encoding). Could someone please 
>hit me over the head and tell me what I'm doing wrong, or else explain to me 
>why this is happening? Same source doc, same stylesheet, different result 
>depending on whether I run Xalan standalone or from inside Cocoon... 
>puzzling, to say the least.
>  
>
I know this is being simplistic, but have you looked at the raw text of 
the file to see if there's anything funny going on in there? If you 
wish, send me the file off-list - both xhtml and stylesheet, and I'll 
see if I can reproduce the problem.

>| >| If we work together, I think we'll fix this.
>| >
>| >Well I hope this helps! :-)
>|
>| It certainly does. But I've given you a few more assignments above!
>
>I have hopefully fulfilled them to the extent that you expected. :-)
>  
>
You certainly have.

Send me the XHTML file off list and I'll see if I can find anything with 
that. If not, the next stage is to see how the Xlink stuff has changed 
since 2.0.4, which it certainly has. If you want, you could try copying 
the files from 2.0.4 to 2.1 and recompile. See how that goes. Just the 
files in org.apache.cocoon.xml.xlink.*.

Regards, Upayavira



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Mime
View raw message