cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: comments !!
Date Sun, 27 Aug 2000 00:00:41 GMT
Robin Green wrote:
> 
> Stefano Mazzocchi <stefano@apache.org> wrote:
> >Robin Green wrote:
> > > I agree that losing semantic information is bad, and that this cannot
> >really
> > > be legislated against.
> >
> >Please, try to define "semantic information"? You can't. There is no
> >such thing as "global semantics", each semantic is associated to the
> >context it lives in. So, there is nothing _more_ semantic in
> >
> >  <docbook:para>Hello World!</docbook:para>
> >
> >than in
> >
> >  <svg:text>Hello World!</svg:text>
> >
> >than in
> >
> >  <xhtml:p>Hello World!</xhtml:p>
> >
> >than in
> >
> >  <fo:block>Hello World!</fo:block>
> >
> >just a different context of interpretation.
> 
> You're quite right of course - the context-relativeness of information, and
> the differences between the technical Shannon sense of information and the
> more everyday sense of "useful information" are two important things that
> tends to be obscured by (for example) many strong AI proponents in
> computing, philosophy and cognitive science, as Raymond Tallis explains
> excellently in his book "The Explicit Animal" - but that's getting far off
> the point! :-) But at least it shows I'm not alone in that.

Shannon was very concerned about the difference between the transfer of
"information" and the transfer of "knowledge". So the web.

The first generation of the web (the one we use today) was concerned
about "information", just like Shannon started with information theory
and never got into knowledge theory.

The second generation of the web should be concerned about "knowledge",
which means: transfer the right information for the right context.

But if you think that XML (alone) can solve that, well, you are missing
much of the picture.

Anyway, for those of you more interested in these topics, my speech for
ApacheCON 2000 Europe in London is entitled "Toward a semantic web: a
look at XML from outter space" and will try to get deep into these
fields to show what's real, what's serious and what's total hype in the
XML world.

(well, please, forgive the auto-publicity spam :)
 
> Anyway, I didn't really mean some incoherent notion of "absolute loss of
> semantics", I just meant, loss of semantics relative to some desired
> use-case.

I understood that, don't worry, I just wanted to be more precise for
others that might not be as confident in these subjects as you are.
 
> >Look at XHTML: many use it as a "very simple semantic markup" if you
> >leave out the font="" align="" and blah blah attributes that define
> >semantics in the layout context (a.k.a. style). The Cocoon document DTD
> >enforces this, reusing the HTML tags where appropriate (no use in
> >creating new names for tags that work).
> >
> >But FO is _somewhat_ different from the other markup and gives a "sense
> >of incoherence" with the other W3C schemas.... I spent 18 months to find
> >out "what" is this incoherence and where it comes from and now I think I
> >got it.
> >
> >It's due to several things combined:
> >
> >1) wrong name: XSL stands for eXtensible Stylesheet Language.... but
> >neither FO nor XSLT have nothing to do with style. Style is the process
> >of adding semantics for the layout context, it doesn't have anything to
> >do with tree transformation or defyning elements that describe that
> >semantics.
> 
> And because XSLT and FO were originally one unified spec, XSL, people are
> easily confused.
> 
> >CSS is the only "true" stylesheet because it "adds information
> >orthogonally", this is what "considered harmful" means in the article:
> >XSLT is not orthogonal.
> >
> >I proposed the XSL WG to change the name of the languages to
> >
> >  XTL - eXtensible Transformation Language
> >  FO - Formatting Objects
> >
> >but they still think XSL is the new DSSSL and this would mean throw away
> >their past. Sure, they have the argument that XTL would become too
> >complex if turned into a "general" transformation language. Well, as an
> >argument, it's weak as anything: people are already planning to use XSLT
> >extensively in B2B to transform one schema into another and styling done
> >in XSLT (without the use of final CSS) is already considered a bad
> >practice.
> >
> >They simply don't want to admit they've been wrong since day one in
> >fighting CSS instead of adopting it.
> >
> >This leads to the second part of the problem:
> >
> >2) FO cannot be styled with CSS.
> >
> >They made sure something like this is not possible, or, at least, not
> >recommended in the spec, unlike SVG (a much better effort in all senses)
> >which makes CSS the very core of its styling part, defining semantics
> >with the graphic elements and keeping the style at the CSS level.
> >
> >Why can't FO do the same? Why are we _forced_ to use XSLT to "tranform"
> >(note, not "style") something into FO?
> >
> >This is the key problem: tree transformation can be used for styling,
> >but it's a bad practice. It should be avoided.
> 
> Could you explain why a bit more? Is it just because it outputs "pure FO"?

Ok, short example

 <page>
  <para>hello world</para>
 </page>

can be transformed into

 <html>
  <body>
   <p aligh="center">hello world</p>
  </body>
 </html>

or

 <html>
  <head>
   <link rel="stylesheet" type="text/css" href="style.css"/>
  </head>
  <body>
   <p>hello world</p>
  </body>
 </html>

with a CSS stylesheet

P { align: center }

If you transform this into FO, you _MUST_ include all style as element
attributes, you can't simply use the FO as "layout structure" and apply
fine-tuning CSS later on... you always have to regenerate the FO with
the tree transformation.

What's the problem with this, you might think, concerns are still
separated.

Sure, but if SVG was designed around their CSS integration, why couldn't
FO be CSS-friendly as well?

> > > However, I think the author of the article is missing the wider point:
> > > Remember, it is very easy to write XML that is almost or completely
> > > meaningless, either because it is based on a proprietary format or
> >because
> > > it is not designed to be easily parsed for useful meaning.
> >
> >The author is caught into the trap "S"-trap: XSL vs. CSS both define
> >'stylesheets', there is clear overlap, which is better?
> >
> >There is 'no' overlap whatsoever: the XSL should try to enforce this
> >instead of keeping on fighting the "style war" and remove that damn "s"
> >from their language names!!!!
> >
> > > There is a Plain English Campaign to stop the use of unnecessary jargon
> >by
> > > public officials, here in Britain. Perhaps, analogously, someone should
> > > start a Semantic XML Campaign, to campaign for semantically-rich uses of
> >XML
> > > and semantic preservation over networks? It's not a very inspiring
> >subject,
> > > sure, but it's an important one from a software engineering point of
> >view.
> >
> >Careful, this is something different:
> 
> True, I chose a bit of a grating analogy, sorry.
> 
> >"unnecessary jargon" can be
> >translated into "keep the semantics in the appropriate context"... or,
> >more technically, don't send me a schema I can't understand, or that I
> >can't translate into something I understand.
> >
> >A big vocabulary is sort-of transformation: what you call "jargon" is a
> >schema that is not frequently used by your thinking, or you might not
> >know entirely.
> 
> And sometimes jargon consists of words which do not really add anything
> useful to the communication, and are just used to create a false impression
> of superiority!
> 
> >Such a campaign is almost equal to the "this page is valid HTML 4.0"
> >campaign: it maximizes visibility to use of the appropriate schema for
> >the required context.
> >
> > > The other thing is market demand driving greater semantics, of course -
> >and
> > > I think in terms of searching at least, sites will find it very
> >advantageous
> > > in terms of getting targeted hits, to use richer markup in promoting
> >their
> > > sites electronically (I'm not thinking of spam, but search engine
> >metatags
> > > etc.) - and then there's B2B, of course.
> >
> >This is where, as we say in italian, the "donkey falls" :) (no offense
> >intended)
> >
> >If you think that having more "semantic" schema will ease searching, you
> >are not only wrong, you are missing a lot of the W3C effort.
> >
> >The problem of XML (and SGML as well) is the 'babel syndrome': there
> >will be tons of schemas, sure, lots of semantic content, but how do you
> >search it if you don't know the schema?
> >
> >It's turning a language into a babel of strong-typed dialects.
> >
> >Today, search engines know HTML and try to "estimate" euristically the
> >semantic meaning of a particular content, to rate it in a significant
> >way: their success is based on the "quality" of such euristics.
> >
> >People think: when the web will be made of XML documents, searching will
> >be much more "semantic".
> >
> >Wrong! Dead wrong!
> >
> >Let us suppose we have such a web (which will take decades to be
> >created, if ever): you want to reserve your summer vacation on a trip to
> >the Java island and you want to find out if there is a travel agency
> >that is cheaper than the one down the street.
> >
> >What do you search for? how do you know what markup has been used to
> >publish the page of that java travel agency?
> >
> >Ok, let's guess XHTML... then what?
> >
> >Hmmm, the search engine accepts XPath queries, but how do you know what
> >element has been used to markup what you're looking for?
> >
> >It's clearly a dead end, it won't pass my father's test, it would die
> >out.
> 
> True - XPath is just too technical. Something more visual is needed for the
> general public - like combo boxes.

Sure an improvement over "insert your xpath here", but still, you should
not be thinking about the schema used to markup the information you are
looking for, no matter how easy this markup information is provided to
you via the UI.

> >So, let's search for the textual content first: "Java Travel Agency"
> >with the EN language (hoping the agency has xml:lang="en" text in their
> >pages).
> >
> >The result is a list of schemas (not pages!) that contain that textual
> >reference.
> >
> >  - programming language
> >  - geographical information
> >  - military operations
> >  - travelling
> >
> >then you "refine" your visit thru it. (this has been used (and
> >patented?) recently by a company called xyzsearch or something like
> >that)
> >
> >But what if the list is something like
> >
> >  - XUIURL schema
> >  - eBUKfj schema
> >  - DDLT schema
> >
> >now what? you iterate thru them to find out, big deal!
> >
> >Sure, more semantic information means "potential" better searches. But
> >don't "assume" we'll have them: the road is long and bumpy and a very
> >few people seem to understand that (luckly the W3C director surely does)
> 
> Makes sense. It's much more complicated than I at first thought, I'll grant
> you that.

Good, no problem is impossible to solve if you fully understand both the
potentials and the complexities involved.
 
> But it can be tackled incrementally. 

I cannot agree more: you can't think of a semantic web taking over the
good old web in a short time and in a big one step. It's silly to
forecast something like this.

> There will always be information so
> obscure that it will not fit into a predefined schema for searching. But for
> more common queries (or high value queries such as searching for car dealers
> or real estate agents), there's a great incentive (on both the consumer and
> supplier sides) to achieve more accurate results (not just more accurate,
> but that is part of it). 

I agree.

> We currently have
> 
> 1. Dumb search engines
> 2a. General directories (yahoo.com and dmoz.org)
> 2b. Specialist directories (about.com, also dmoz.org again, as it has
> thousands of specialist volunteer editors)
> 
> Yahoo.com just can't keep up. I volunteer at dmoz.org (ODP), and obviously,
> while it does better, even the ODP is orders of magnitudes away from being
> able to catalogue "the entire web". And google.com is often surprisingly
> accurate, but not 100% of the time - it's no AI!

Google rocks because they care. Their euristic is complex and very
smart, they perform an incredibly efficient and still brilliant
datamining, unlike grep-like engines like Altavista.
 
> Google uses link counts to "judge" relevance. Dmoz (ODP) harnesses the
> expert knowledge of thousands of volunteers from all over the web and the
> world to build the most comprehensive directory of the web. However,
> comparatively few people want to become an ODP editor, and even fewer are
> admitted - but every webmaster probably wants to make his/her site appear in
> search results - so the obvious next step is to expand site "self-indexing",
> using not just free-form META tags but semanticly-rich schemas.
> 
> The fact is that not everyone uses the web to search for really obscure
> information about, say, cryptozoology, or Sorokin's theories of Cultural
> Dynamics. Thousands of queries clump together in _very_ common areas:
> weather, news, sex, shopping. Carve these out and define some schemas, get
> wide agreement to use them (the hard bit!!), and you start to make the web
> more powerful in certain well-defined areas. Not all areas, but it's a
> start. Like I say, it's an incremental process.

Yep... still the babel problem exists if you can't define such a
de-facto schema and transformation looses semantic information... but
this is such a young technology... we just need time.
 
> Once the chicken-and-egg problem with XML is seriously addressed by several
> high-profile sites (and I won't try and predict when that will happen), and
> once browser and search engine support is much better, the first successful
> specialist search schemas should lead to a cascade effect - partly because
> of "me too" syndrome, partly for sound reasons.

This is why I want Cocoon to help this: Cocoon will provide better ways
to build your "first generation" web site, like your managers demand
today. But will silently provide you the tools (semantic views, local
semantic indexing capabilities, semantic network estimation) that will
allow you to switch to the "second generation" web when there is a
requirement for it.

I know it will happen, and when it does we'll be there with the right
technology.

Then nothing will stop us from placing a cocoon in every web server of
this planet :)
 
> And - as Stefano noted - rather than unordered lists of schemas in search
> results, schemas can be loosely mapped to more user-friendly category names
> at hierarchical web directories like dmoz.org. 

Yep.

> Now clearly the number of
> schemas to catalogue will be many orders of magnitude fewer than the number
> of individual resources - and this is a much more feasible challenge. With a
> mixture of human-edited schema directories, and heuristic-enhanced XML-aware
> search engines, you'll be able to narrow your search down to a reasonable
> semantic context first (e.g. travel agent schema), and then get highly
> relevant results. The best of both worlds!

Agreed.
 
> I think we're singing from the same hymn-sheet, but I'm just a little more
> optimistic than Stefano.

I'm skeptic because I see very few people 'getting' the real problems
involved and I'm scared by the complexity of each one of the
'incremental' steps that must happen to bootstrap such a semantic web.

But I have some cool ideas about this that I'll share only when the time
is right :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------



Mime
View raw message