cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject [RT] i18n in Cocoon and language independent semantic contexts
Date Sun, 11 Jun 2000 13:20:27 GMT
The problems i18n poses are big and it's the reason why both Java and
XML have Unicode support right from their core (a big advantage over
almost all other programming languages).

Cocoon = Java + XML, so this means we need to place i18n support right
into our core, or we'll be doomed by design limitations for the rest of
its lifetime (and force us to do a cocoon3 to fix design problems)

Let's see those problems:

1) internal messages: errors, logs, comments all should be driven by the
JVM locale. Normally this is performed with Java ResourceBoundles.

Is this enough? Should we create an XML version of those resource
boundles? is this a following the golden-hammer antipattern of "do it
all with XML"?

2) uri space: good URIs don't change and are human readable. The sitemap
allows you to enforce the first (if you don't use extentions to indicate
your resources), and your URI-space design should enforce the second
one.

Be careful, something like "/news/today" is a perfectly designed URI for
a website and can stand ages without requiring to change. But it's  not
human readable by non-english speakers. So it would be the italian
equivalent "/notizie/oggi".

This leads to something that was already expressed on the list: can the
sitemap allow to enforce different views of the same URI space based on
i18n issues? What's the best manageable way to do this? Where does
separation of concerns accounts here? What's the best way to scale such
a thing?

And, most important, is something like this worth the effort? (I've
never seen translated URI spaces, is there a web site that does this?)

3) schemas: this is something I've been concerned about for quite some
time and maybe some of you who were into the SGML world before can give
us advices. Schema has one embedded natural language.

 <page xml:lang="it">
  <title>Hello World!</title>
  <paragraph>
   <bold>Hello World!</bold>
  </paragraph>
 </page>

can be translated into

 <page xml:lang="it">
  <title>Ciao a tutti!</title>
  <paragraph>
   <bold>Ciao a tutti!</bold>
  </paragraph>
 </page>

but this _requires_ authors to understand english to understand the
markup. The real translation is

 <pagina xml:lang="it">
  <titolo>Ciao a tutti!</titolo>
  <paragrafo>
   <grassetto>Ciao a tutti!</grassetto>
  </paragrafo>
 </pagina>

which could easily pass my "father's test" (he doesn't speak english),
while the previous one would not.

Are those pages different? No, they are different views of the same
information.

[Note: Ok, we made a very strong hypothesis: each natural language has
the same expressivity range. Many could argue this is far from being
true. For example, there is no italian equivalent for the english word
"privacy" and there is no english equivalent for the word "pizza". Also,
everybody knows that many jokes loose their funny meaning if translated
(italians use policemen like americans use blondes). Many italian
dialects contain expressions that would require pages italian to express
the same feeling to the listener (italian dialects are mostly oral-only
languages), Japanese embeds several language constructs to indicate
difference of social position and so on.]

But it can be reasonably assumed that schemas contain the same amount of
information and expose themselves with different views. Natural
languages as "knowledge representation syles" of abstract structured
relationship between different semantic areas.

So, let us suppose there exists one schema and the reference schema is
written in english.

It should be possible to introduce a view of this schema by allowing
semantic inheritance of the elements.

Let's make an example:

 <page:page xml:lang="en" xmlns:page="urn:page" xmlns:style="urn:style">
  <page:title>Hello World!</page:title>
  <page:paragraph>
   <style:bold>Hello World!</style:bold>
  </page:paragraph>
 </page:page>

and we want to translate this into HTML so we need page->html and
markup->html (supposing page doesn't contain the equivalent of "style"
semantic information)

No we want this to be readable for italians that don't know english, but
want to keep the same stylesheets. How could we achieve that?

I have a solution that requires (unfortunately) patching both the
namespace and XMLSchema specifications:

 <pagina:pagina xml:lang="it" 
    xmlns:pagina="urn:page" xmlns:pagina:lang="it" 
    xmlns:stile="urn:style" xmlns:stile:lang="it">
  <pagina:titolo>Ciao a tutti!</pagina:titolo>
  <pagina:paragrafo>
   <stile:grassetto>Ciao a tutti!</stile:grassetto>
  </pagina:paragrafo>
 </pagina:pagina>

where the XMLSchema should indicate that

 <pagina> -(equals)-> <page>
 <titolo> -(equals)-> <title>
 <paragrafo> -(equals)-> <paragraph>

and all create different natural languages views of the same namespace
(urn:page) while

 <grassetto> -(equals)-> <bold>

for the namespace (urn:style).

Then, it can be possible for XML parsers to map all those elements in
"language-neutral semantic equivalent classes" where XPaths can access
them indipendently of their natural language form.

For example, the XPath "/page/title" should return "Ciao a Tutti!" if
applied to the italian version of the page and "Hello World!" if applied
to the english version (version indicated with xml:lang), but should be
transparent on the language used to present the schema elements.

This allows another level of separation of concern where who creates the
XSLT is a english designer and who writes the XML document is an italian
journalist. (yes, the eurofootball.com web site triggered many of these
thoughts)

Today, XPath and XMLSchema create contracts on the "strings of unicode
chars" used to express semantic ideas. 

This is, IMO, a big limitation since what is "linked" is not the element
name but the semantic context it represents.

This would allow the creation of classes of equivalence for XML schemas,
each one representing a different view of the same language independent
semantic context they all share.

Where would something like this be useful in Cocoon?

For all schemas used to generate the resources (user level) and for
Cocoon's own schemas (mainly the sitemap and configurations).

For example, non-english-speakers could install and maintain Cocoon's
sitemaps or, sitemaps with localized schemas can be given to people with
different language skills.

Being completely "orthogonal" on the schema (this is why it needs to
patch both namespaces and schema capabilities), this would positively
impact on every XML usage.

                         ------------------ o ------------------

Ok, but what can we do inside Cocoon without having to proprietarely
extend the XML specifications?

Also, how can we simplify the sitemap evolution without compromising the
rest of the system?

I think a possible solution is sitemap pluggability and compilation.

You could think at the sitemap like a big XSP taglib that is responsible
to drive directly the execution of the resource creation pipelines.

It would also increase performance, since matching could be optimized
and what not.

It would also allow different sitemap schemas to be developped. In
theory, you could create your own sitemap schema.

Well, this collection of RT is admittedly wild.

Digest with caution but think about it extensively since I know many FS
hides between the lines.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------


Mime
View raw message