cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Florian G. Haas" <f.g.h...@gmx.net>
Subject Re: 2.1: Neither LinkSerializer nor LinkGatherer producing a complete link list
Date Sun, 31 Aug 2003 08:11:13 GMT
Hi again.

On Saturday 30 August 2003 20:08, Upayavira wrote:
| This is in fact a problem initialising the Deli block. I don't know much
| about it, and I can't really explain why it should fail in the CLI but
| not in the servlet, but I'm pretty sure I've seen this before. It is a
| workaround, not a solution, but if you rebuild Cocoon excluding the Deli
| block, you'll get rid of this exception. Maybe I should add 'avoid deli'
| to the CLI docs :-(

OK. I'll ignore this error for now.

| >Setting the logkit level to DEBUG yields these interesting results in
| >sitemap.log:
| >
| >DEBUG   (2003-08-30) 14:03.21:692   [sitemap.generator.file] (Unknown-URI)
| >Unknown-thread/FileGenerator: processing file src/fgh.xtm
| >DEBUG   (2003-08-30) 14:03.21:692   [sitemap.generator.file] (Unknown-URI)
| >Unknown-thread/FileGenerator: file resolved to
| >file:/home/fgh/public_html/src/fgh.xtm
| >DEBUG   (2003-08-30) 14:03.22:713   [sitemap] (Unknown-URI)
| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
| >URI=http://www.w3.org/1999/xhtml NAME=link RAW=link ATT=href
| >NS=http://www.w3.org/1999/xhtml VALUE=../css/tm4web.css
| >DEBUG   (2003-08-30) 14:03.23:928   [sitemap] (Unknown-URI)
| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
| >URI=http://www.w3.org/1999/xhtml NAME=a RAW=a ATT=href
| >NS=http://www.w3.org/1999/xhtml VALUE=mailto:f.g.haas@gmx.net
| >DEBUG   (2003-08-30) 14:03.23:937   [sitemap] (Unknown-URI)
| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
| >URI=http://www.w3.org/1999/xhtml NAME=a RAW=a ATT=href
| >NS=http://www.w3.org/1999/xhtml
| > VALUE=http://validator.w3.org/check/referer DEBUG   (2003-08-30)
| > 14:03.23:938   [sitemap] (Unknown-URI)
| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
| >URI=http://www.w3.org/1999/xhtml NAME=img RAW=img ATT=src
| >NS=http://www.w3.org/1999/xhtml
| > VALUE=http://www.w3.org/Icons/valid-xhtml10 DEBUG   (2003-08-30)
| > 14:03.23:939   [sitemap] (Unknown-URI)
| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
| >URI=http://www.w3.org/1999/xhtml NAME=a RAW=a ATT=href
| >NS=http://www.w3.org/1999/xhtml VALUE=http://jigsaw.w3.org/css-validator
| >DEBUG   (2003-08-30) 14:03.23:940   [sitemap] (Unknown-URI)
| >Unknown-thread/ExtendedXLinkPipe: Transforming to XLink:
| >URI=http://www.w3.org/1999/xhtml NAME=img RAW=img ATT=src
| >NS=http://www.w3.org/1999/xhtml
| >VALUE=http://jigsaw.w3.org/css-validator/images/vcss.gif
| >
| >As pointed out in my earlier reply to Jeff, the result document contains
| > 12 links. Why is ExtendedXLinkPipe apparently resolving only 6?
|
| Can you post some of the document that your scanning? Some with links
| that are found and some with links that aren't?

Getting back to the output above for a minute, please check if I'm correct on 
the following points:
* Link #1 should be crawled as it's the CSS reference, which is available 
locally.
* Link #2 should not be crawled since it's a mailto: URI.
* Links #3,4,5,6 should not be crawled since they are remote links.

The question remains, why is it not crawling all the other links, and also not 
requesting the locally available CSS? BTW, is there a way to tell the CLI to 
retrieve remote *images* referenced via img src="http://...", even though 
remote *links* (<a href="http://...">) are omitted? 

Now, getting to your question, here is the XHTML output of the document I used 
as an example (this is found at ~fgh/en/index.html in my Cocoon setup). This 
is somewhat abbreviated, but I'm sure it gets the point across nonetheless:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta content="text/html;charset=ISO-8859-1" http-equiv="content-type" />
    <link href="../css/tm4web.css" type="text/css" rel="stylesheet" />
    <meta content="The dynamic TM4J web application" name="generator" />
    <title>Florian G. Haas</title>
  </head>
  <body>
    <div id="main">
      <div class="topicinfo">
        <div class="about">
          <h1 class="topictitle">Florian G. Haas</h1>
          <h3 class="topictype"><a title="person" 
href="person.html">person</a></h3>
        </div>
        <div>
          <div class="navigation">
            <div class="navbox">
              <h4 class="assoctitle"> is interested in, is of interest to</h4>
              <ul class="assocmembers">
                <li class="assocmember"><a title="Topic Maps" 
href="topicmaps.html">Topic Maps</a> <span class="memberrole"> (object of 
interest)</span></li>
                <li class="assocmember"><a title="Java" 
href="java.html">Java</a> <span class="memberrole"> (object of 
interest)</span></li>
              </ul>
              <h4 class="assoctitle">#subject</h4>
              <ul class="assocmembers">
                <li class="assocmember"><a title="Austria" 
href="austria.html">Austria</a> <span class="memberrole"> 
(origin)</span></li>
              </ul>
              <!-- ... -->
            </div>
          </div>
          <div class="topicoccurs">
	    <!-- ... --->
            <ul class="resources">
              <li>email address: <a 
href="mailto:f.g.haas@gmx.net">f.g.haas@gmx.net</a></li>
              <li>PGP key: TODO.</li>
            </ul>
          </div>
        </div>
      </div>
    </div>
    <div id="validation-buttons">
      <a href="http://validator.w3.org/check/referer" id="valid-xhtml-anchor">
        <img alt="Valid XHTML 1.0!" 
src="http://www.w3.org/Icons/valid-xhtml10" id="valid-xhtml-image" />
      </a>
      <a href="http://jigsaw.w3.org/css-validator" id="valid-css-anchor">
        <img alt="Valid CSS!" 
src="http://jigsaw.w3.org/css-validator/images/vcss.gif" id="valid-css-image" 
/>
      </a>
    </div>
  </body>
</html>

The only difference between the links that leave their footprints in 
sitemap.log and the ones that don't is that the latter are more deeply nested 
inside a buch of <div>'s, but that surely shouldn't make a difference, should 
it?

| What you can do is improve your link view to use an XSL that simplifies
| your page down to just the links you want and then run that into the
| linkSerializer. A hack, but it might at least get you going.

Watch out. Now it's getting really bizarre.

I use this stylesheet in order to extract the links:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xlink="http://www.w3.org/1999/xlink/"
                xmlns:xhtml="http://www.w3.org/1999/xhtml"
                version="1.0"
                exclude-result-prefixes="xhtml">
  <xsl:output indent="yes"/>

  <xsl:template match="/">
    <output>
      <xsl:apply-templates/>
    </output>
  </xsl:template>

  <xsl:template match="//*[@href|@src|@xlink:href]">
    <xsl:copy-of select="."/>
  </xsl:template>

  <xsl:template match="*">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="text()">
    <!-- ignore -->
  </xsl:template>
</xsl:stylesheet>

Now, when I run this through Cocoon, whether in a separate pipeline or in a 
view, I get this output:

<?xml version="1.0" encoding="ISO-8859-1"?>
<output xmlns:xlink="http://www.w3.org/1999/xlink/">
  <link xmlns="http://www.w3.org/1999/xhtml" href="../css/tm4web.css" 
type="text/css" rel="stylesheet"/>
  <a xmlns="http://www.w3.org/1999/xhtml" 
href="mailto:f.g.haas@gmx.net">f.g.haas@gmx.net</a>
  <a xmlns="http://www.w3.org/1999/xhtml" 
href="http://validator.w3.org/check/referer" id="valid-xhtml-anchor">
  <img alt="Valid XHTML 1.0!" src="http://www.w3.org/Icons/valid-xhtml10" 
id="valid-xhtml-image"/>
  </a>
  <a xmlns="http://www.w3.org/1999/xhtml" 
href="http://jigsaw.w3.org/css-validator" id="valid-css-anchor">
  <img alt="Valid CSS!" 
src="http://jigsaw.w3.org/css-validator/images/vcss.gif" 
id="valid-css-image"/>
  </a>
</output>

Which corresponds precisely to the sitemap.log entries cited earlier, some 
links get recognized, some don't.

Now, I open my document, ~fgh/en/index.html, in a browser (it still displays 
nicely). I download it to disk. Then, I run it through Xalan (I use 2.5.1), 
from the command line, using the same XSL stylesheet I reference in my Cocoon 
setup. Here's the result:

<?xml version="1.0" encoding="UTF-8"?>
  <output xmlns:xlink="http://www.w3.org/1999/xlink/">
  <link xmlns="http://www.w3.org/1999/xhtml" href="../css/tm4web.css" 
type="text/css" rel="stylesheet"/>
  <a xmlns="http://www.w3.org/1999/xhtml" href="person.html" 
shape="rect">person</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="topicmaps.html" 
shape="rect">Topic Maps</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="java.html" 
shape="rect">Java</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="austria.html" 
shape="rect">Austria</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="arno.html" shape="rect">Arno 
Sosna</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="tm4j.html" shape="rect">the 
TM4J project</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="fhib.html" shape="rect">FHIB, 
Fachhochschul-Studiengang Informationsberufe</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="hico.html" 
shape="rect">hico</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="this-website.html" 
shape="rect">this website</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="my-writings.html" 
shape="rect">my writings</a>
  <a xmlns="http://www.w3.org/1999/xhtml" href="mailto:f.g.haas@gmx.net" 
shape="rect">f.g.haas@gmx.net</a>
  <a xmlns="http://www.w3.org/1999/xhtml" 
href="http://validator.w3.org/check/referer" id="valid-xhtml-anchor" 
shape="rect">
  <img alt="Valid XHTML 1.0!" src="http://www.w3.org/Icons/valid-xhtml10" 
id="valid-xhtml-image"/>
  </a>
  <a xmlns="http://www.w3.org/1999/xhtml" 
href="http://jigsaw.w3.org/css-validator" id="valid-css-anchor" shape="rect">
  <img alt="Valid CSS!" 
src="http://jigsaw.w3.org/css-validator/images/vcss.gif" 
id="valid-css-image"/>
  </a>
</output>

Which is precisely what I want! (nevermind the encoding). Could someone please 
hit me over the head and tell me what I'm doing wrong, or else explain to me 
why this is happening? Same source doc, same stylesheet, different result 
depending on whether I run Xalan standalone or from inside Cocoon... 
puzzling, to say the least.

| >| If we work together, I think we'll fix this.
| >
| >Well I hope this helps! :-)
|
| It certainly does. But I've given you a few more assignments above!

I have hopefully fulfilled them to the extent that you expected. :-)

Later,
Florian



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Mime
View raw message