forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marc Portier" <>
Subject Subject: [proposal] Design for build(s) and related.
Date Mon, 19 Aug 2002 12:20:16 GMT
Hi all,
I've been promising some actual work (build refactoring,
libre redesign, ...) and I recently haven't come any further
than submitting a documentation change of a single & into a
% character.
Although not visible to you I have been spending quite some
time thinking on all of this... brain keeps running circles over a number of topics,
some of them have been half-expressed in threads here and
there, I needed this round-up at least for my own sanity,
maybe it helps your mental health as well :-)
I would like to be as productive with your time as I
possibly can, yet I think some random thoughts need to be
expressed to make myself (totally un-) clear.
Bottom line is I'm not certain enough on these things to
call on votes, at this stage opinions and feedback would be
My observations and motivations are honest but my knowledge
is limited, so please help out in the sections where I'm
just plain wrong or incomplete.
Challenging is great, offering corrections and other
solutions is even better.
Be welcomed to yet another big fit-in exercise.


[opinion: on the end of meaning of filename-extensions]
What is XML doing to our filename extensions?  It is making
them damn useless, that is what!
Since everyone has become so fond of this XML thing the only
file-extension en vogue seems to be .xml, and no matter what
the file is about someone decided that it should open in ie6
when I double-click it. In fact that is not too bad, since
every now and then that _is_ quite satisfactory in terms of
getting a view on what is in there.  On the edit side of
things it's a bit different: I'm using a number of different
XML editors and all of them tend to be used for different
- Pollo for coco-xmap, ant-build.xml,..
- Xmetal for the document writing ones:
ot-presentation-ware, xdocs, xhtml
- gvim for the short ones and/or when the change is a quick
hack (or I don't want to have my editor do whitespace
rewritings that mess-up cvs diffs)
- XMLSpy for the tabular-like DTDs
- Excelon Stylus for the none doctyped ones that get defined
as I write them
- Intellij 3.0 ea for the ones that somewhat relate to
Java-code working on them (the convenience of having them in
the same environment)

In fact somebody should write some small app that gets
configured to startup when I double-click *.xml and based on
reading the doctype and some config file decides on
launching the correct app. Wouldn't that be neat?
(naah really neat would be the content-management filesystem
that does this out of the box)

Well my premise is not really correct; there still is some
filename extension diversification on my hard-disk:
[1] There is the whole lot of increasingly less interesting
ones I hope to see banned before the end of my lifetime:
data files that are (so called) 'managed' by a specific
application.  The truth often is that that app 'stole' my
data, and I politely have to ask that app if I can please,
see, edit, print, search... my own information.  *.xls,
*.doc, *.whachamacallit
Let us forget about these in terms of editable things,
instead (with POI) these could become one of:

[2] There is a fair amount of view-type file extensions:
*.jpeg, *.gif, *.html, *.pdf, *.the-media-types that are to
be considered as consummation only types: we should not edit
its. The extension is a hint for choosing the decoder/viewer
to actually get some consumable format, either spots on
paper or pixels on screen, the last ones could be moving,
and of course other senses could be triggered (audio)

[3] Archiving file extensions: *.arj, *.tar.gz, *.zip which
feels like an odd type of directory/folder (which by some
quirk in history are most often seen extension-less) rather
then one of the previously mentioned view-types. (although:
self-extracting executables, or archives of e.g. a full
html-site with index.html starting point and in-lined images
could be seen as a view/consummation type of thing?  If only
all of these would have some manifest file in there
describing their meaning of existence... heck that _is_ the
same for directories)

[4] The xml-world by itself seems to keep a nicer
extension-based distinction between the formats it is
bringing to life itself: *.xsl, *.dtd, *.xsd, *.fo,
*.svg,...  looking at the catch-all *.xml, this practice
seems to be somewhat contradictory, no? (except for the
noon-xml DTDs of course) To the unknown *.svg really doesn't
look like it is written down in xml.

What I'd like to conclude at this stage is that the concept
we all know as the file-extension:
- Is overused for catching different aspects of really
what_its_for, what_its_about, how_its_encoded, ...
- Has lost meaning in the catchall *.xml case.
(this last remark equally holds for the MIME-TYPE text/xml,
is svg supposed to be communicated as such? why (not)?
(I googled for this: there seems to be image/svg+xml,
image/svg-xml and text/svg+xml, does anyone actually know?)

[proposal 1]
Let us use hierarchical filename extensions for our source
They should be capable of holding the different aspects in
distinct and recognizable ways.
At this stage, I think we can pull it off with two steps in
the extension:
The pattern then becomes

here the document-type-part surely reflects the !DOCTYPE in
XML files
examples: *.faq.*, *.doc.*, *.howto.*, *.build.*

while the content-consumimation-part should provide a hint
to the encoding and thus viewer to use to visualize its
examples: *.*.xml (native structure format), *.*.html,
*.*.pdf, *.*.jpg
(this is pretty much the filesystem substitute for what the
HTTP header Content-Type is doing, just didn't want to call
it content-type-part, as it is possibly confusing towards

completed we get examples like:
what_is_forrest.faq.xml: src xml file (native structure
format) of a faq about what forrest is
forrest_contract.doc.pdf: pdf version of a document on the
contract between forrest and its users
cvs-ssh.howto.html: html version of a howto on using cvs
over ssh
MyFutureClass.xjava.html: html version (probably syntax
highlight) of an original Java source file (this suggests
that there was some other process that generated the marked
up xjava version out og
MyFutureClass.xjdoc.pdf: pdf version of some xdoclet
produced javadoc for the same class. (comparable suggestion
on xjdoc generation)

What about unstructured (historical) documents?
examples: mpo.jpeg, index.html
Dunno, I guess they are, so let them? Forrest should just
'read' those, not work on them anyway.

Could we drop the *.xml?
As in the *.xsl en *.xsd examples we could decide on
dropping the *.xml all together.  Just using one extension
part would then assume the .xml suffix.  However we risk
loosing the clear distinction with unstructured/historical
files.  Having only one extension part should mean: they
don't know about using two.

The thin line between document-type and content-consummation
could be hard to make:
Are xhtml and svg (to mention only two) document-type
describing or rather content-consummation describing?
Maybe we don't need to decide that in general, but instead
let the use in practice just define how it was intended (by
author/publisher) in each particular case? Next to that I
don't see why there would be no room for something like

Looking at it from this angle the multiple parts of the new
extension become like a route or a trail describing how to
get from *.metric.xml to jpg via svg. (Supposing there could
be more then one route)

Just being practical: (since we generate static sites) would
anyone know if this line of thinking can be applied to
filenames on CD-ROMS?


[opinion: on the units of content management: the file and
the directory ]
The bags we call directories on the filesystem are under the
control of the content creators.  They decide upon using
them primarily for their management activities.  It is the
typical unit on which they:
- set rw permissions
- control ownership
- archive, backup, restore, move
- are able to have cvs subsection actions (be it commit,
update, diff...)
These concerns therefore can pragmatically take over on the
concern to group all files that address a common subject.
(we hope that both will not be in collision)
To group all documents of a given document-type is normally
only the last concern to get any attention in the play.
(and that is okay IMHO)

I'd like to observe that:
We are likely to find at all times different types of
documents mingled in one directory.
Even if some of them will always (but never say never) be
put into separate directories.
Reversed: a directory (name) can not be used to identify the
various types of documents the contained files are able to

[proposal 2]
Let us avoid using the leading part of path-identifiers (and
URIs) to have anything to do with 'type' of documents.
[proposal 2bis]
And as an on-the-side, minor proposal: let us again put
images that are only in-lined in one document, with that
document.  (Having a central image-bank is another thing,
should be considered, but is not what we are currently
seeing a need for)


[opinion: on separating all concerns up to the level that we
need to express them all inside ONE (too short) URI]
In the web-world according to Cocoon, incoming requests
drive everything.  This everything is far more than merely
where that resource is stored on the local hard drive.
Careful design of the URI request space needs to reserve
space in that URI string to express the different aspects of
how the content is to be retrieved, presented, and encoded.
The different aspects I see:
- still find the resource-file to start from
- be able to select the correct pipeline.
  that pipeline becomes the one thing that is capable of
	(1) producing the output-format as promised by the URI link
	    (find the pdf <a href="**.pdf"> here</a>)
	(2) starting from the document-type we are finding
- decide on the additions to the document (the things it
should be aggregated with)
  being the navigation view (a meaning for tab?) to apply to
it, and the chosen skin
- other generation customizations (run-time formatting
parameters) we haven't encountered just yet.

Some of these aspects (like the skin is today) can be fixed
for the whole site being generated.
Generating static site versions means we cannot express any
of these with ?param=value additions
The normal path-like remainder allows for naturally
organizing hierarchical aspects (aspects that have some
super-sub or containment relation to each other)
In this case however we will be forced to see it as just
position-sequential parts, which means we will need to just
countdown and give the different aspects a position in the
The worst thing about this is that the resulting scheme can
possibly break when we need to consider new aspects in the
scheme in the future.  (see
So we need to think ahead now, or need some smart idea for
making it extensible.

Given the fact of the static generation (fact being that the
result needs to be stored and published as is on a dumb
webserver) the forrest URI request space needs to map an
actual file and directory layout (could still largely be
different then the src docs layout of course).  This poses
some extra constraints on which parts of the URI we can use.

[proposal 3]
The proposal for having the from-type-to-type-trail file
extensions can be reused inside URIs as well.  They will
help us select the pipeline to apply AND they will (more or
less) double-check that the input-type for the pipeline
indeed maps all the actions (transformations) we are going
to run over it.

This allows to have the pipeline matchers work rather on the
trailing part of the incoming URI.
Leaving us with the leading part to grasp the rest.
That rest ....

??? is it more than finding the actual docs on the files?
(oh I hope it is not)
??? what are the tabs doing?
??? what decides in which tab you are living?


[opinion: on inversion of control]
Well, I was preparing this... and then: this guy said it all
a lot better:

The most used build-target for the forrest project (apart
from clean) must be docs.
My feeling is that it passes by at really telling people
what forrest is about.  I know we might be struggling to pin
down what it _is_ about, but we could easily agree that it
is _not_ about having all our users publish the (or) equivalent on their websites.

The 'docs' target to me is performing the functional
equivalent of 'testing' what  a 'built' of forrest can do.
That we do it on our own documentation stuff makes all the
sense of the world. We are as good a case as anyone else's
project to start from, but we should stop (to let our build
file) pretend it is the *only* case.

Our build system has no clear thing to build, and thus lacks
some visible production to be reused.

[proposal 4]
Solving this very issue has been living under the umbrella
'refactoring build.xml with the ideas of the forrestbot'
Actually doing it however reveals more things that are not
- with the bot (how it could be both extended and reused -->
further reading)
- with the existential thing forrest should become:

And that is: an ant task.
However that is a long way off (I know Nicola is working on
one, but I would like to call his vision the
cocoon-generator task since it is focusing on that point
In fact it could be challenged at this stage that it is
feasible at all to cramp all that complexity in a task.

The great thing however is that Nicola is offering us
There is a lot of great things to read about it over at (I'm still catching on) but I would like
to present it to you now as way to call cents.  Cents being
complex ant-like tasks, that (and this is great) can be
packaged as separate projects almost: with their own
resources in files and directories and an actual build.xml
(called the xbuild.xml)
(Of course would of have been a far better
name :-))

The consequence of this is that we give control back to the
projects that use forrest.  So in good IoC-tradition
(Hollywood principle: don't call us, we'll call you) we end
up waiting in the ant chain of dependencies to get called.
This rather then BE the template build.xml and directory
structure that is forced upon your project.

Looking at the interface of this forrest-cent we will need a
complex set of arguments that parameterize the site-assembly
process.  It will be best to catch those in some
configuration file (I even think centipede proposes such
thing: properties.xml?)  Where the generated site needs to
be put on hard disk will surely be another thing we need to
be informed about.

For those that do not like the centipede dependency (they
exist) we should foresee also a means to package up forrest
in different incarnations (more useful targets).  What comes
to mind is:
- a full independent bin distribution one needs to install,
then set FORREST_HOME and call|bat to have it do
its work --> this would be typical for forrestbot setups on
servers that unlike workstations would not even have ant in
place or such
- a maven plugin (hoping Jeff Turner stays around)
- a nicely handmade and standalone Java program? leading to:
- as said the ant task and thus the jar to add to the

[proposal 4bis]
On-the-side, the proposal on a more organizational level
would be to more and more actively promote forrest, and
support our users with less excuses
65&w=2) and more easy applicable and working solutions.
Oh yes, less docos like this as well probably :-)


[opinion: on files we don't know about.]
Currently forrest is supporting a handful DTDs (faq, howto,
xdoc, status, changes, dtd...) and stands (paralyzed?) at
the dawn of thinking how to ever work on all the other stuff
(maven is doing) typically needed in project sites.
- junit test reports
- java source marked up for syntax highlighting
- javadoc pages
- mail archives
- ...
Some of these will be dynamically pulled through a
generator, some of them however will need to be prebuild by
another process (ant task, cent or other)
(javadoc is a great example of this: you _need_ to produce
the whole set in one run)

Also, inside projects that want to use forrest as their
documentation and site publishing thingy people might have
been using different standards (oh no, here comes the next
docbook discussion thread)

The two-fold challenge/question is
1. Where to put/find these files
- with the purpose of picking them up in the site generation
- with the intent to cross reference them from other parts
of the documentation system (including the navigation, be it
via book or libre)
2. and how to augment the sitemap with possibly required new
pipelines (e.g. to use the new your-type2xdoc.xsl)

Forrest will need to work on files we don't know about.
We need to provide some mechanism to extend
- the number of document types for which we provide support.
(new pipelines?)
- the location of these babes in a way they can be
referenced (from documentation, or navigation)
While doing the exercise we should consider making it such
that our users can use it as well.
For most of them building a mount-sitemap will require a
learning-investment that is not justified by what they get
out of it. (They would rather ant-style allong if you ask

[proposal 5]
Restating the multipart file-extension solution is getting a
bit overly satisfied with my own thinking of course...  (btw
Steven was the first using it for the *.dtdx.*, I just
happened to like the elegance of how that was added to

The additional thing at this stage would be to let
forrest-using projects declare which other document-types
they want to throw in.  Some configuration file that allows
them to express which document types they use, how to
recognize them using the file-extension, which public
identifier they propose, where the DTD and the appropriate
*2xdoc.xsl is.

Typical snip of configurating xml could be:

    <!-- for *.stuff.xml files -->
      name="great stuff"
      publicId="-//MY CORP//DTD For Great Stuff//EN"

And maybe even for non XML docos that requires/provide a
cocoon generator to start with:

    <!-- for *.java files -->
      name="java source code"

Out of this the actual sitemap (parts of it) could be
Equally the catalog file could be rendered.  Both placed at
predefined locations

As stated before: forrest internally should use the same
mechanism, maybe that could double as the pre-packaged set
of core document-types so people don't need to do that work
again. (sure enough, the quality of our current set of DTDs
is what is drawing and keeping our current user base, so let
us leave room for that appreciation in the future)

As for the problem of where to put (and how to reference)
these thrown-in files... there should just be a list of
sections we could link to, in combination with some
aliassing scheme and a config that explains where these
files are inside your project (at the time you call the
forrest cent task)

Something along the lines of
    <part name="xdocs"
location="./src/documentation/content/xdocs"  ref="/" />
    <part name="xjavadoc"      location="./build/xjavadoc"
ref="/javadoc" />
    <part name="xjunit report" location="./build/junit"
could do the trick.

Combined with some generated ant copy-tasks (a bit of the
way forrestbot works) that move the described parts to the
cocoon context directory (into the position the @ref is
FYI: The cocoon context directory is what the off-line
generation process is using as a starting point to find
sitemap, content, stylesheets,....)

The next thing we need is an addressing scheme based on this
to allow cross references between e.g. handwritten xdocs and
generated stuff.  That could be done with simply using the
link href="/junit/..." off course.  This approach however
assumes that the same person writing the href in the content
can control the mentioned configuration fragment (and vice
versa).  In cases where the hrefs could even be generated by
another tool, that assumption only gets more optimistic
(less likely)
Next to this, to complete:
- Relative links don't start with a / and remain inside one
- The navigation building files (book.xml?) should use the
same scheme.

In every case augmenting to
    <part name="xjunit report"
      <link-alias name="/test-reports" />
      <link-alias name="/unit" />
would catch the idea of a content-part claiming more of the
reference-space to be pointing to themselves.

Again, generating matchers (i.e. parts of the sitemap) out
of this that redirect to the reference would make sure the
correct document is retrieved.
On the other hand using this information inside a smart
transformer that gets applied just before the skinning would
be even better, since that would reduce the number of
possibly duplicated copies in the generated site (specially
bad for generated pdfs) This beast should be on the look-out
of href attributes and replace found link-aliases with their
actual ref.

[proposal 5bis]
On-the-side minor proposal added here since the recent xhtml
thread: just maybe there should be some soc-consideration
between xdocs and xhtml2.  While xdocs seems to be a nice
way to write down relatively simple documentation, use of (a
subset of) xhtml2 as the intermediate (just before skinning)
looks to me like not such a bad choice.  I understand that
it needs more investigation and the docuheads on the list
could help us out a great deal... it is just my feeling that
it would be helpful to other people if they could re-use
some ready my-type2xhtml rather then rewrite that towards
my-types2xdoc format. So the or-or question to me could get
an and-and answer: different purpose, different document


[opinion: on cross references and linkmaps.]
In fact just thinking about this opens memories of something
like the first cocoon-dev message I've seen (that I remember
about at least:
862& =2)

Rereading it now again (I was hoping for hints), I realize
that I've probably never understood this in the first place.
("Only two things are infinite, the universe and human
stupidity, and I'm not sure about the former." -Albert
Einstein )
So please indulge my ignorance...

The sitemap is a great gift.  But it is not solving world
famine.  It manages the distribution of incoming requests,
but (irreversibly) it fails to produce the map off all
available resources it is managing! Put in other words: the
URI that would fit the common (i.e. not the cocoon
connotation) web-description 'SiteMap' would never be based
on the sitemap.xmap.

(This pretty much in the same way that website-authorization
files do a good job in blocking requests that are not
allowed, but leave it to other systems to produce a
navigation and set of cross-refs that only contain links to
files you have access to.)

Navigation should not try to reverse the information in the
sitemap, because it will not succeed.
In the overall solution however, 'Navigation' (cross
references) must be able to (via the end user) close the
loop the sitemap has opened: Sitemap points from URI to
content.  Content wants to link back to URI.

[proposal 6]
I would like to propose the forrest SitePlan. (it is in this
early stage probably rather incomplete, but hopefully this
gets us going.)
This just grabs together what we already had going...


  <!-- section for the file types -->

    <!-- for *.stuff.xml files -->
      name="great stuff"
      publicId="-//MY CORP//DTD For Great Stuff//EN"

    <!-- for *.java files -->
      name="java source code"


  <!-- section for describing the content for the cocoon
context dir -->
    <part name="xdocs"
    <part name="xjavadoc"
    <part name="xjunit report"
      <link-alias name="/test-reports" />
      <link-alias name="/unit" />


At the heart of forrest there should be one of these as
well. Projects using forrest can slide in their own to
augment/override the settings in there.

This project specific siteplan should be joined with the one
from forrest-core that serves as a fallback and an example.
>From this file the catalog file gets generated.
>From this file the sitemap (xslt task) gets generated.
>From this file a temporary ant build file is generated.
This file is input to the forrest-cent.
This file is picked up by the future forrestbot as well...


[opinion: on the public interface of build files]
To let the forrestbot step in just like that, we are however
missing some more information.

Basically this lack of intelligence (CIA meaning of the
word) is a side effect of the cent-approach: Since we
inversed control (see one of the previous topics) forrest is
no longer on top of things: we don't control all details of
the build-file any more! (because we want to hide those, and
thus not be put into the user's project build file)
However, the siteplan introduced the notion of different
content-parts that need to be moved into the cocoon-context
directory before the site generation takes place.  Some of
the files in those content-parts might of been generated by
other tasks (javadoc example).  Which one of those tasks we
depend upon is unclear to the outside forrest-bot since it
is the project ant-file that controls that.

Unless we parse the build-file? (that would be spying
around, uh) No not even. Parsing alone will not help you,
additionally you will need to understand the various
internals of the called tasks to really know where the stuff
you need is being put by the various tasks.  This is because
ant-build-files (when considered as being objects our
components) have a public interface consisting of only
voids.  To the outside world they list a number of targets
(methods) to call.  These targets are somewhat virtual names
relating to actual real-life 'productions' that are made by
the enclosed tasks.  (Setting a check-property is of course
an equally virtual 'production',  but you catch my drift,
right?)  With the 'being void'-statement I draw attention to
the fact that the generated 'productions' are but hidden
side-effects, only known to the implementer of the ant-build
file. (Like after you have called an ant target for some new
project, you always have to guess where in the ./build (or
was it ./dist?) you can start looking around for something
you recognize/expect/hope for?)

I lack the experience and insight in the bigger running of
things in the jakarta and the apache world, but I would love
to see this talked and discussed about in a bigger forum.
Taking this up on the forrest level seems to be not fitting
the bigger scope of it.
The very basic view I currently have is for ant files to be
able to list in their -projecthelp actually where which
productions are to be found.  This could be based on the
fact that the <target> would have some means of expressing
that.  (Note that there could be more then one production
per target, that productions could be root-like-directories,
and that one could still choose not to 'return' == 'make
known to public' all of them).  Additionally calling ant at
the commandline (or via the <ant> task of course could maybe
have an option to specify where (some of the) productions
(by name?) should be placed (or copied to)
This would allow for
- bots like ours to be written once and for all: just know
which ant task to call on the project, and know where the
result is put. That reduces the workstages to the abstract
3: let the bot get the src, call that specific target,
deploy the result it generated. Done.  The forrest-bot
becomes an ant-bot since it will work for anything that is
- also gump like integration-meta-projects could be defined
in terms of a meta-build file (simple ant again) that
expresses which targets on which projects to call and be
source to whatever target in the other project...
- finally ant-based-installation (like the acorn patch from
Jef Turner showed) could be based on a general, reusable
idiom here.
oh, well...

Ant targets are voids with noticeable side-effects (known to
the implementer)
The actual productions however are not 'returned'.  As a
result tasks on the outside of the build script may depend
on them, but cannot get a hold of them. (unless you open the
ant file and track down where your 'production' was put.)

For the outside forrest-bot this has the net effect that we
need to capture the knowledge of the dependent tasks inside
the project.

[proposal 7]
It is a bit avoiding the discussion at this stage, but
pulling some of it back into forrest arena the siteplan
could probably list the ant targets we depend on (since it
was already listing that required output-location)

    <part name="xjavadoc"
			  <ant file="./build.xml" target="api-docs" />

It does create some bad vibes around duplicating the
knowledge inside the ant file (the @depends of the target
that calls the forrest-cent) and thus opens ambiguity for
users that would expect also the local forrest target to be
automatically calling the <producer> tasks?  So they would
omit the @depends expecting forrest to do that.


We're in a pursuit for meaning, something that in any
possible way will lead to solutions of the 'Turtles all the
way down'-type:
In this case it will be more and more dots in filenames :-)

"Would you tell me, please, which way I ought to go from
"That depends a good deal on where you want to get to," said
the Cat.
"I don't much care where --" said Alice.
"Then it doesn't much matter which way you go," said the
"--- so long as I get somewhere," Alice added as an
"Oh, you're sure to do that," said the Cat, "if only you
walk long enough."
(Lewis Carol, Alice In Wonderland)

Thanks for walking up to here.

Unresolved issues I see:
- what are tabs (and could they be requiring a prefix in our
- use cases for generating navigation-tree-structures for
the site.  (libre is to be seen as a way to generate those
based on rules)

Marc Portier
Outerthought - Open Source, Java & XML Competence Support

View raw message