cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Green" <gree...@hotmail.com>
Subject Re: comments !!
Date Sat, 26 Aug 2000 14:04:36 GMT
Stefano Mazzocchi <stefano@apache.org> wrote:
>Robin Green wrote:
> > I agree that losing semantic information is bad, and that this cannot 
>really
> > be legislated against.
>
>Please, try to define "semantic information"? You can't. There is no
>such thing as "global semantics", each semantic is associated to the
>context it lives in. So, there is nothing _more_ semantic in
>
>  <docbook:para>Hello World!</docbook:para>
>
>than in
>
>  <svg:text>Hello World!</svg:text>
>
>than in
>
>  <xhtml:p>Hello World!</xhtml:p>
>
>than in
>
>  <fo:block>Hello World!</fo:block>
>
>just a different context of interpretation.

You're quite right of course - the context-relativeness of information, and 
the differences between the technical Shannon sense of information and the 
more everyday sense of "useful information" are two important things that 
tends to be obscured by (for example) many strong AI proponents in 
computing, philosophy and cognitive science, as Raymond Tallis explains 
excellently in his book "The Explicit Animal" - but that's getting far off 
the point! :-) But at least it shows I'm not alone in that.

Anyway, I didn't really mean some incoherent notion of "absolute loss of 
semantics", I just meant, loss of semantics relative to some desired 
use-case.

>Look at XHTML: many use it as a "very simple semantic markup" if you
>leave out the font="" align="" and blah blah attributes that define
>semantics in the layout context (a.k.a. style). The Cocoon document DTD
>enforces this, reusing the HTML tags where appropriate (no use in
>creating new names for tags that work).
>
>But FO is _somewhat_ different from the other markup and gives a "sense
>of incoherence" with the other W3C schemas.... I spent 18 months to find
>out "what" is this incoherence and where it comes from and now I think I
>got it.
>
>It's due to several things combined:
>
>1) wrong name: XSL stands for eXtensible Stylesheet Language.... but
>neither FO nor XSLT have nothing to do with style. Style is the process
>of adding semantics for the layout context, it doesn't have anything to
>do with tree transformation or defyning elements that describe that
>semantics.

And because XSLT and FO were originally one unified spec, XSL, people are 
easily confused.

>CSS is the only "true" stylesheet because it "adds information
>orthogonally", this is what "considered harmful" means in the article:
>XSLT is not orthogonal.
>
>I proposed the XSL WG to change the name of the languages to
>
>  XTL - eXtensible Transformation Language
>  FO - Formatting Objects
>
>but they still think XSL is the new DSSSL and this would mean throw away
>their past. Sure, they have the argument that XTL would become too
>complex if turned into a "general" transformation language. Well, as an
>argument, it's weak as anything: people are already planning to use XSLT
>extensively in B2B to transform one schema into another and styling done
>in XSLT (without the use of final CSS) is already considered a bad
>practice.
>
>They simply don't want to admit they've been wrong since day one in
>fighting CSS instead of adopting it.
>
>This leads to the second part of the problem:
>
>2) FO cannot be styled with CSS.
>
>They made sure something like this is not possible, or, at least, not
>recommended in the spec, unlike SVG (a much better effort in all senses)
>which makes CSS the very core of its styling part, defining semantics
>with the graphic elements and keeping the style at the CSS level.
>
>Why can't FO do the same? Why are we _forced_ to use XSLT to "tranform"
>(note, not "style") something into FO?
>
>This is the key problem: tree transformation can be used for styling,
>but it's a bad practice. It should be avoided.

Could you explain why a bit more? Is it just because it outputs "pure FO"?

> > However, I think the author of the article is missing the wider point:
> > Remember, it is very easy to write XML that is almost or completely
> > meaningless, either because it is based on a proprietary format or 
>because
> > it is not designed to be easily parsed for useful meaning.
>
>The author is caught into the trap "S"-trap: XSL vs. CSS both define
>'stylesheets', there is clear overlap, which is better?
>
>There is 'no' overlap whatsoever: the XSL should try to enforce this
>instead of keeping on fighting the "style war" and remove that damn "s"
>from their language names!!!!
>
> > There is a Plain English Campaign to stop the use of unnecessary jargon 
>by
> > public officials, here in Britain. Perhaps, analogously, someone should
> > start a Semantic XML Campaign, to campaign for semantically-rich uses of 
>XML
> > and semantic preservation over networks? It's not a very inspiring 
>subject,
> > sure, but it's an important one from a software engineering point of 
>view.
>
>Careful, this is something different:

True, I chose a bit of a grating analogy, sorry.

>"unnecessary jargon" can be
>translated into "keep the semantics in the appropriate context"... or,
>more technically, don't send me a schema I can't understand, or that I
>can't translate into something I understand.
>
>A big vocabulary is sort-of transformation: what you call "jargon" is a
>schema that is not frequently used by your thinking, or you might not
>know entirely.

And sometimes jargon consists of words which do not really add anything 
useful to the communication, and are just used to create a false impression 
of superiority!

>Such a campaign is almost equal to the "this page is valid HTML 4.0"
>campaign: it maximizes visibility to use of the appropriate schema for
>the required context.
>
> > The other thing is market demand driving greater semantics, of course - 
>and
> > I think in terms of searching at least, sites will find it very 
>advantageous
> > in terms of getting targeted hits, to use richer markup in promoting 
>their
> > sites electronically (I'm not thinking of spam, but search engine 
>metatags
> > etc.) - and then there's B2B, of course.
>
>This is where, as we say in italian, the "donkey falls" :) (no offense
>intended)
>
>If you think that having more "semantic" schema will ease searching, you
>are not only wrong, you are missing a lot of the W3C effort.
>
>The problem of XML (and SGML as well) is the 'babel syndrome': there
>will be tons of schemas, sure, lots of semantic content, but how do you
>search it if you don't know the schema?
>
>It's turning a language into a babel of strong-typed dialects.
>
>Today, search engines know HTML and try to "estimate" euristically the
>semantic meaning of a particular content, to rate it in a significant
>way: their success is based on the "quality" of such euristics.
>
>People think: when the web will be made of XML documents, searching will
>be much more "semantic".
>
>Wrong! Dead wrong!
>
>Let us suppose we have such a web (which will take decades to be
>created, if ever): you want to reserve your summer vacation on a trip to
>the Java island and you want to find out if there is a travel agency
>that is cheaper than the one down the street.
>
>What do you search for? how do you know what markup has been used to
>publish the page of that java travel agency?
>
>Ok, let's guess XHTML... then what?
>
>Hmmm, the search engine accepts XPath queries, but how do you know what
>element has been used to markup what you're looking for?
>
>It's clearly a dead end, it won't pass my father's test, it would die
>out.

True - XPath is just too technical. Something more visual is needed for the 
general public - like combo boxes.

>So, let's search for the textual content first: "Java Travel Agency"
>with the EN language (hoping the agency has xml:lang="en" text in their
>pages).
>
>The result is a list of schemas (not pages!) that contain that textual
>reference.
>
>  - programming language
>  - geographical information
>  - military operations
>  - travelling
>
>then you "refine" your visit thru it. (this has been used (and
>patented?) recently by a company called xyzsearch or something like
>that)
>
>But what if the list is something like
>
>  - XUIURL schema
>  - eBUKfj schema
>  - DDLT schema
>
>now what? you iterate thru them to find out, big deal!
>
>Sure, more semantic information means "potential" better searches. But
>don't "assume" we'll have them: the road is long and bumpy and a very
>few people seem to understand that (luckly the W3C director surely does)

Makes sense. It's much more complicated than I at first thought, I'll grant 
you that.

But it can be tackled incrementally. There will always be information so 
obscure that it will not fit into a predefined schema for searching. But for 
more common queries (or high value queries such as searching for car dealers 
or real estate agents), there's a great incentive (on both the consumer and 
supplier sides) to achieve more accurate results (not just more accurate, 
but that is part of it). We currently have

1. Dumb search engines
2a. General directories (yahoo.com and dmoz.org)
2b. Specialist directories (about.com, also dmoz.org again, as it has 
thousands of specialist volunteer editors)

Yahoo.com just can't keep up. I volunteer at dmoz.org (ODP), and obviously, 
while it does better, even the ODP is orders of magnitudes away from being 
able to catalogue "the entire web". And google.com is often surprisingly 
accurate, but not 100% of the time - it's no AI!

Google uses link counts to "judge" relevance. Dmoz (ODP) harnesses the 
expert knowledge of thousands of volunteers from all over the web and the 
world to build the most comprehensive directory of the web. However, 
comparatively few people want to become an ODP editor, and even fewer are 
admitted - but every webmaster probably wants to make his/her site appear in 
search results - so the obvious next step is to expand site "self-indexing", 
using not just free-form META tags but semanticly-rich schemas.

The fact is that not everyone uses the web to search for really obscure 
information about, say, cryptozoology, or Sorokin's theories of Cultural 
Dynamics. Thousands of queries clump together in _very_ common areas: 
weather, news, sex, shopping. Carve these out and define some schemas, get 
wide agreement to use them (the hard bit!!), and you start to make the web 
more powerful in certain well-defined areas. Not all areas, but it's a 
start. Like I say, it's an incremental process.

Once the chicken-and-egg problem with XML is seriously addressed by several 
high-profile sites (and I won't try and predict when that will happen), and 
once browser and search engine support is much better, the first successful 
specialist search schemas should lead to a cascade effect - partly because 
of "me too" syndrome, partly for sound reasons.

And - as Stefano noted - rather than unordered lists of schemas in search 
results, schemas can be loosely mapped to more user-friendly category names 
at hierarchical web directories like dmoz.org. Now clearly the number of 
schemas to catalogue will be many orders of magnitude fewer than the number 
of individual resources - and this is a much more feasible challenge. With a 
mixture of human-edited schema directories, and heuristic-enhanced XML-aware 
search engines, you'll be able to narrow your search down to a reasonable 
semantic context first (e.g. travel agent schema), and then get highly 
relevant results. The best of both worlds!

I think we're singing from the same hymn-sheet, but I'm just a little more 
optimistic than Stefano.


________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com


Mime
View raw message