forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: file: implemented (Re: cvs commit: ...)
Date Thu, 12 Dec 2002 19:10:29 GMT
Jeff Turner wrote:
> On Thu, Dec 12, 2002 at 12:13:06AM -0800, Stefano Mazzocchi wrote:
> 
>>Jeff Turner wrote:
>>
>>
>>><rant>
>>>The CLI is evil and should have been drowned at birth.  The Cocoon CLI
>>>can best be described as a crappy 'wget' implementation tacked onto the
>>>side of Cocoon.  It is slow as hell, full of bugs (eg css images) and
>>>practically unmaintained.  Rewriting wget in a corner of Cocoon was a
>>>blindingly stupid thing to do, and I am not about to waste my time fixing
>>>its bugs.  I would rather find a _real_ wget implementation in Java, that
>>>can handle CSS and doesn't do screwy things with filenames, and IF
>>>invoking Cocoon through the HTTP interface proves too slow (unlikely),
>>>then I'd wrap Cocoon in an Avalon block and feed it URLs passed over RMI.
>>></rant>
>>
>>Jeff, tell me, are you aware of how *exactly* the Cocoon CLI works?
> 
> 
> No.  <rant> should be <uninformed rant>.

When I talk about something I don't know, I tend to ask questions first, 
than express my opinions. But that's me.

> Still, can you tell my why Cocoon + lightweight HTTP server + a threaded
> crawler like:
> 
> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> 
> won't be a zillion times faster?  And have a healthier user community,
> because it is sufficiently general to interest multiple parties.

Extracted from o.a.c.Main.java

     /**
      * Processes the given URI and return all links. The algorithm is 
the following:
      *
      * <ul>
      *  <li>file name for the URI is generated. URI MIME type is 
checked for
      *      consistency with the URI and, if the extension is inconsistent
      *      or absent, the file name is changed</li>
      *  <li>the link view of the given URI is called and the file names 
for linked
      *      resources are generated and stored.</li>
      *  <li>for each link, absolute file name is translated to relative 
path.</li>
      *  <li>after the complete list of links is translated, the 
link-translating
      *      view of the resource is called to obtain a link-translated 
version
      *      of the resource with the given link map</li>
      *  <li>list of absolute URI is returned, for every URI which is 
not yet
      *      present in list of all translated URIs</li>
      * </ul>
      * @param uri a <code>String</code> URI to process
      * @return a <code>Collection</code> containing all links found
      * @exception Exception if an error occurs
      */
public Collection processURI(String uri) throws Exception {

The Cocoon CLI extensively uses the cocoon-view to do two major things:

  1) obtaining links
  2) pushing back translated links

Cocoon CLI does link translation but it's Cocoon *ITSELF* that places 
them in the right position and this happens *before* things gets serialized.

If you go the wget path you have to implement a link parser and 
translator for *every* hypertext-capable binary files our serializers 
can come up with.

On the other hand, by implementing a Cocoon-aware CLI, we are gaining 
insights from the actual semantic content of the data and we can 
manipulate it when it's *still* semantically meaningful (thus earier to 
process).

Don't know about others, but I think it's a much more elegant (and 
code-wise cheaper) solution than a semantically-unaware wget-like one.

But again, that's me.

-- 
Stefano Mazzocchi                               <stefano@apache.org>
--------------------------------------------------------------------



Mime
View raw message