cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nils Kaiser <>
Subject Re: Crawling over web pages with cocoon (Running a pipeline per page)
Date Tue, 05 Sep 2006 09:50:52 GMT
So you mean you would get a dump of the site and call cocoon pipelines 
for the convertion. I like the idea of doing this in two steps, as it 
allows us to check everything / remove pages not needed before 
converting. Maybe a list of the urls (crawled together by wget or 
something else) would be enough, and I could fetch the content from the 
page from inside the pipeline (htmlgenerator).

The question is, how would I call cocoon then? Using the CLI or the 
cocoon bean?

I saw there is some crawler functionality along the bean but could not 
find any info about how to use it.

An additional point would be the ability to generate some useful logs 
about which page was converted and where... do you do something similar?


> For a one-time job of converting a collection of webpages, I'd use an
> external crawler like wget, and create Cocoon pipelines to do the
> format conversion.
> You'll need a "table of contents" page which generates (at least
> indirect) links to all other pages, and use this page as an entry
> point for wget.
> You could of course do the whole thing in Cocoon, but it's probably
> faster to implement and test with this combination of tools.
> -Bertrand
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message