Mailing-List: contact cocoon-users-help@xml.apache.org; run by ezmlm
Message-ID: <39A6554B.B133E45F@apache.org>
Date: Fri, 25 Aug 2000 13:15:23 +0200
From: Stefano Mazzocchi <stefano@apache.org>
Organization: Apache Software Foundation
MIME-Version: 1.0
To: cocoon-users@xml.apache.org
CC: Cocoon <cocoon-dev@xml.apache.org>,
 	Scott Boag <Scott_Boag/CAM/Lotus@lotus.com>
Subject: Re: comments !!
References: <F23jAKYVpMmOlbcrYP0000006aa@hotmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Robin Green wrote:
> 
> >Any comments on this article:
> >
> >http://www.oasis-open.org/cover/lie-foch.html
> >
> >Manpreet Singh.

I've discussed the above issues with many of the XSL WG people in
several different occasions and I came to the conclusion there is some
truth in that: FO are "somewhat" harmful in the sense that the XSL WG
choose to fight with the CSS WG for some historical reasons.
 
> I agree that losing semantic information is bad, and that this cannot really
> be legislated against.

Please, try to define "semantic information"? You can't. There is no
such thing as "global semantics", each semantic is associated to the
context it lives in. So, there is nothing _more_ semantic in

 <docbook:para>Hello World!</docbook:para>

than in
 
 <svg:text>Hello World!</svg:text>

than in 

 <xhtml:p>Hello World!</xhtml:p>

than in

 <fo:block>Hello World!</fo:block>

just a different context of interpretation.

Look at XHTML: many use it as a "very simple semantic markup" if you
leave out the font="" align="" and blah blah attributes that define
semantics in the layout context (a.k.a. style). The Cocoon document DTD
enforces this, reusing the HTML tags where appropriate (no use in
creating new names for tags that work).

But FO is _somewhat_ different from the other markup and gives a "sense
of incoherence" with the other W3C schemas.... I spent 18 months to find
out "what" is this incoherence and where it comes from and now I think I
got it.

It's due to several things combined:

1) wrong name: XSL stands for eXtensible Stylesheet Language.... but
neither FO nor XSLT have nothing to do with style. Style is the process
of adding semantics for the layout context, it doesn't have anything to
do with tree transformation or defyning elements that describe that
semantics.

CSS is the only "true" stylesheet because it "adds information
orthogonally", this is what "considered harmful" means in the article:
XSLT is not orthogonal.

I proposed the XSL WG to change the name of the languages to 

 XTL - eXtensible Transformation Language
 FO - Formatting Objects

but they still think XSL is the new DSSSL and this would mean throw away
their past. Sure, they have the argument that XTL would become too
complex if turned into a "general" transformation language. Well, as an
argument, it's weak as anything: people are already planning to use XSLT
extensively in B2B to transform one schema into another and styling done
in XSLT (without the use of final CSS) is already considered a bad
practice.

They simply don't want to admit they've been wrong since day one in
fighting CSS instead of adopting it.

This leads to the second part of the problem:

2) FO cannot be styled with CSS.

They made sure something like this is not possible, or, at least, not
recommended in the spec, unlike SVG (a much better effort in all senses)
which makes CSS the very core of its styling part, defining semantics
with the graphic elements and keeping the style at the CSS level.

Why can't FO do the same? Why are we _forced_ to use XSLT to "tranform"
(note, not "style") something into FO?

This is the key problem: tree transformation can be used for styling,
but it's a bad practice. It should be avoided. So, instead of turning
this into a "style war", why don't we do the right thing

 unknown schema -> transformation -> known schema [+ style]

where "unknown" is in the context of the program that has to "consume"
the schema (browser, B2B consumer, or other), "known" means a schema
that is known in that context and style information is optional.

A few examples of this would be

 docbook -> xtl -> fo + css
 myDTD -> xtl -> xhtml + css
 bixtalk -> xtl -> ebxml
 tableDTD -> xtl -> svg + css
 
This would finally fix the symmetry, it would bring peace to the "style
war" and finally "separate concerns" in between working groups, thus
maximizing throughput.

A single WG (XSL) has been responsible of 

 - tree transformations (XSLT)
 - tree queries (XPath)
 - formatting objects (XSL)

sure, when it started it simply had one concern

 - apply DSSSL to XML

but it turned out to be something entirely different and they did a good
job in separating the specs.

Now they should finish the good job and finally separate the WG into
more focused groups, one of each of the concerns they have.... but,
hell, they recently rechartered to keep going exactly the same.

Sharon Adler already asked me: "why do you care about how we work?" I
don't, really, I'm only concerned about what you guys produce as
byproduct of that work and what I see, expecially on the "general
vision" is not what I like (I love the technology, but I don't like the
ideas behind it... this indicates stuctural problems to me)

[Scott, I copied you on this because I'd like to hear your comments
(Scott is member of the XSL WG)]

> However, I think the author of the article is missing the wider point:
> Remember, it is very easy to write XML that is almost or completely
> meaningless, either because it is based on a proprietary format or because
> it is not designed to be easily parsed for useful meaning.

The author is caught into the trap "S"-trap: XSL vs. CSS both define
'stylesheets', there is clear overlap, which is better?

There is 'no' overlap whatsoever: the XSL should try to enforce this
instead of keeping on fighting the "style war" and remove that damn "s"
from their language names!!!!
 
> There is a Plain English Campaign to stop the use of unnecessary jargon by
> public officials, here in Britain. Perhaps, analogously, someone should
> start a Semantic XML Campaign, to campaign for semantically-rich uses of XML
> and semantic preservation over networks? It's not a very inspiring subject,
> sure, but it's an important one from a software engineering point of view.

Careful, this is something different: "unnecessary jargon" can be
translated into "keep the semantics in the appropriate context"... or,
more technically, don't send me a schema I can't understand, or that I
can't translate into something I understand.

A big vocabulary is sort-of transformation: what you call "jargon" is a
schema that is not frequently used by your thinking, or you might not
know entirely.

Such a campaign is almost equal to the "this page is valid HTML 4.0"
campaign: it maximizes visibility to use of the appropriate schema for
the required context.
 
> The other thing is market demand driving greater semantics, of course - and
> I think in terms of searching at least, sites will find it very advantageous
> in terms of getting targeted hits, to use richer markup in promoting their
> sites electronically (I'm not thinking of spam, but search engine metatags
> etc.) - and then there's B2B, of course.

This is where, as we say in italian, the "donkey falls" :) (no offense
intended)

If you think that having more "semantic" schema will ease searching, you
are not only wrong, you are missing a lot of the W3C effort.

The problem of XML (and SGML as well) is the 'babel syndrome': there
will be tons of schemas, sure, lots of semantic content, but how do you
search it if you don't know the schema?

It's turning a language into a babel of strong-typed dialects.

Today, search engines know HTML and try to "estimate" euristically the
semantic meaning of a particular content, to rate it in a significant
way: their success is based on the "quality" of such euristics.

People think: when the web will be made of XML documents, searching will
be much more "semantic".

Wrong! Dead wrong!

Let us suppose we have such a web (which will take decades to be
created, if ever): you want to reserve your summer vacation on a trip to
the Java island and you want to find out if there is a travel agency
that is cheaper than the one down the street.

What do you search for? how do you know what markup has been used to
publish the page of that java travel agency?

Ok, let's guess XHTML... then what? 

Hmmm, the search engine accepts XPath queries, but how do you know what
element has been used to markup what you're looking for?

It's clearly a dead end, it won't pass my father's test, it would die
out.

So, let's search for the textual content first: "Java Travel Agency"
with the EN language (hoping the agency has xml:lang="en" text in their
pages).

The result is a list of schemas (not pages!) that contain that textual
reference.

 - programming language
 - geographical information
 - military operations
 - travelling

then you "refine" your visit thru it. (this has been used (and
patented?) recently by a company called xyzsearch or something like
that)

But what if the list is something like

 - XUIURL schema
 - eBUKfj schema
 - DDLT schema

now what? you iterate thru them to find out, big deal!

Sure, more semantic information means "potential" better searches. But
don't "assume" we'll have them: the road is long and bumpy and a very
few people seem to understand that (luckly the W3C director surely does)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------