cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Washeim <esa...@canuck.com>
Subject Re: [RT] i18n in Cocoon and language independent semantic contexts
Date Mon, 12 Jun 2000 11:06:14 GMT
on 11/6/00 3:20 pm, Stefano Mazzocchi at stefano@apache.org wrote:

> The problems i18n poses are big and it's the reason why both Java and
> XML have Unicode support right from their core (a big advantage over
> almost all other programming languages).
> 
> Cocoon = Java + XML, so this means we need to place i18n support right
> into our core, or we'll be doomed by design limitations for the rest of
> its lifetime (and force us to do a cocoon3 to fix design problems)
> 
> Let's see those problems:
> 
> 1) internal messages: errors, logs, comments all should be driven by the
> JVM locale. Normally this is performed with Java ResourceBoundles.
> 
> Is this enough? Should we create an XML version of those resource
> boundles? is this a following the golden-hammer antipattern of "do it
> all with XML"?

It does appear that, as we build up a set of tools for managing documents,
any that are directed at a 'reader' may as well be in xml. Of course, we
have to have decent editors :)

> 
> 2) uri space: good URIs don't change and are human readable. The sitemap
> allows you to enforce the first (if you don't use extentions to indicate
> your resources), and your URI-space design should enforce the second
> one.
> 
> Be careful, something like "/news/today" is a perfectly designed URI for
> a website and can stand ages without requiring to change. But it's  not
> human readable by non-english speakers. So it would be the italian
> equivalent "/notizie/oggi".
> 
> This leads to something that was already expressed on the list: can the
> sitemap allow to enforce different views of the same URI space based on
> i18n issues? What's the best manageable way to do this? Where does
> separation of concerns accounts here? What's the best way to scale such
> a thing?


We're working on 2 sites currently where this been a fundemental issue...

currently, we have (for file based documents):

/root/hr/hr.xsd (schema) hr.xsl

/root/hr/se_SV/hr.xml
/root/hr/en_US/hr.xml

/root/pr/pr.xsd, pr.xsl
/root/pr/se_SV/pr.xml, instanceXXX.xml, instanceXXX.xml
/root/pr/en_US/pr.xml, instanceXXX.xml, instanceXXX.xml

in both cases, the xsd schema file is used to instantiate an editor... the
instantiated editor in turn reads the exemplar xml (for the sake of
instantiating with reasonable values for the document maintener).

The readers (browsers) view, is mediated in much the same way as the site
map proposes. In this case, with the addition that index.html => index.xml
at the web server.

All index.xml requests (using the cocoon configs) are responded to by a
custom producer. The producer is file system based. It presents views of xml
documents in the file system. It may be passed a series of filters (for
instance, document must have a TITLE element in order to be displayed) and
generates an xml document representing that sub-set of files.

A request for /sverige/nyheter/(index.xml implicitly) is mapped (using a
site-map, of course :) ) only in so far as the file system producer creates
a view of documents that are available as per the map's base. Namely,
/root/pr/se_SV/ <=> /sverige/nyheter/
index.xml will contain the list of instanceXXX.xml available in
/root/pr/se_SV/ . . .

We're doing something like this in the main to keep abreast of the cocoon
architectural changes. It seems to me this is not such big problem. That
both:
1. Human readability and
2. immutability, obtain.

I think. :) We're in rather more of a hurry and, in fact, have two different
forms of site map. Necessity being the mother of invention, we have a
plethora of inventions :) But, we're longing for the day when cocoon 2 is
there . . .
(our second map has the following:

<PAGE REQUEST="login.xml" XSL="ff.xsl" SERVLET="ffLogin">
        <CASE REDIRECT="select_game.xml"
PARAMS="MEMB_XML=file:///path/en_US/ff/registered_panel
.xml">
            <RULE PARAM="UID_NO" OP="NULL_EMPTY" VALUE="FALSE"/>
            <RULE PARAM="UID_NO" OP="GT" VALUE="0"/>
        </CASE>
        <CASE REDIRECT="login.xml"
PARAMS="MEMB_XML=file:///path/en_US/ff/register_panel.xml">
            <GROUP TYPE="OR">
                <RULE PARAM="UID_NO" OP="NULL_EMPTY" VALUE="TRUE"/>
                <RULE PARAM="UID_NO" OP="LTEQ" VALUE="0"/>
            </GROUP>
        </CASE>
    </PAGE>

) YIKES :)

 
> And, most important, is something like this worth the effort? (I've
> never seen translated URI spaces, is there a web site that does this?)

In the main, we don't have any choice when working with dispersed marketing
departments (18 countries) for a global organisation but to accomodate the
uris (and domain names, for that matter. luckily, we control the hosting . .
.)...
 


> 3) schemas: this is something I've been concerned about for quite some
> time and maybe some of you who were into the SGML world before can give
> us advices. Schema has one embedded natural language.
> 
> <page xml:lang="it">
> <title>Hello World!</title>
> <paragraph>
> <bold>Hello World!</bold>
> </paragraph>
> </page>
> 
> can be translated into
> 
> <page xml:lang="it">
> <title>Ciao a tutti!</title>
> <paragraph>
> <bold>Ciao a tutti!</bold>
> </paragraph>
> </page>
> 
> but this _requires_ authors to understand english to understand the
> markup. The real translation is

This assumes something I believe to be false. Namely, that document authors
edit plain text mark-up. They don't in most cases. They use forms interfaces
or wysiwyg interfaces of some kind, but, more below . . .


> <pagina xml:lang="it">
> <titolo>Ciao a tutti!</titolo>
> <paragrafo>
> <grassetto>Ciao a tutti!</grassetto>
> </paragrafo>
> </pagina>
> 
> which could easily pass my "father's test" (he doesn't speak english),
> while the previous one would not.
> 
> Are those pages different? No, they are different views of the same
> information.

But they impose, by virtue of translating structure uneccessarily, an undue
burden of maintenance. In fact, an insufferable one, as far as I'm
concerned.

Not that I don't empathise with the reader of a dtd who doesn't grok the
language. I know this problem well. I've been reading the sgml and derived
xml of two data providers (and working directly with rdbms from the same
companies). One of the companies is Swedish, the other Dutch. I speak
english, german and french. Sigh. Reading Entity declarations is an ODD way
to learn a language.

Needless to say, I had to obtain some help in getting the meaning of
elements right. Well, the DTDs in question where not expressive enough where
localization is concerned. XML schema, however, IS! More, below . . .
....

> [Note: Ok, we made a very strong hypothesis: each natural language has
> the same expressivity range. Many could argue this is far from being
> true. For example, there is no italian equivalent for the english word
> "privacy" and there is no english equivalent for the word "pizza". Also,
> everybody knows that many jokes loose their funny meaning if translated
> (italians use policemen like americans use blondes). Many italian
> dialects contain expressions that would require pages italian to express
> the same feeling to the listener (italian dialects are mostly oral-only
> languages), Japanese embeds several language constructs to indicate
> difference of social position and so on.]
> 
> But it can be reasonably assumed that schemas contain the same amount of
> information and expose themselves with different views. Natural
> languages as "knowledge representation syles" of abstract structured
> relationship between different semantic areas.
> 
> So, let us suppose there exists one schema and the reference schema is
> written in english.

Your example below is NOT of the schema (namely the abstract which could as
well be expressed in any language) but of the instance of that schema. I
mean, with reference to the w3c's specification for
a. an xml document
b. an xml schema defining and constraining said document

> It should be possible to introduce a view of this schema by allowing
> semantic inheritance of the elements.
> 
> Let's make an example:
> 
> <page:page xml:lang="en" xmlns:page="urn:page" xmlns:style="urn:style">
> <page:title>Hello World!</page:title>
> <page:paragraph>
> <style:bold>Hello World!</style:bold>
> </page:paragraph>
> </page:page>
> 
> and we want to translate this into HTML so we need page->html and
> markup->html (supposing page doesn't contain the equivalent of "style"
> semantic information)
> 
> No we want this to be readable for italians that don't know english, but
> want to keep the same stylesheets. How could we achieve that?
> 
> I have a solution that requires (unfortunately) patching both the
> namespace and XMLSchema specifications:
> 
> <pagina:pagina xml:lang="it"
> xmlns:pagina="urn:page" xmlns:pagina:lang="it"
> xmlns:stile="urn:style" xmlns:stile:lang="it">
> <pagina:titolo>Ciao a tutti!</pagina:titolo>
> <pagina:paragrafo>
> <stile:grassetto>Ciao a tutti!</stile:grassetto>
> </pagina:paragrafo>
> </pagina:pagina>
> 
> where the XMLSchema should indicate that
> 
> <pagina> -(equals)-> <page>
> <titolo> -(equals)-> <title>
> <paragrafo> -(equals)-> <paragraph>
> 
> and all create different natural languages views of the same namespace
> (urn:page) while
> 
> <grassetto> -(equals)-> <bold>
> 
> for the namespace (urn:style).

Now, the maintenance and administration of the document AND the document
type depend on as much as THREE people! The two document editors in their
respective languages and the person responsible for the schema (xml schema)
used to validate both types... I have a bad feeling about this . . .


> Then, it can be possible for XML parsers to map all those elements in
> "language-neutral semantic equivalent classes" where XPaths can access
> them indipendently of their natural language form.
> 
> For example, the XPath "/page/title" should return "Ciao a Tutti!" if
> applied to the italian version of the page and "Hello World!" if applied
> to the english version (version indicated with xml:lang), but should be
> transparent on the language used to present the schema elements.
> 
> This allows another level of separation of concern where who creates the
> XSLT is a english designer and who writes the XML document is an italian
> journalist. (yes, the eurofootball.com web site triggered many of these
> thoughts)
> 
> Today, XPath and XMLSchema create contracts on the "strings of unicode
> chars" used to express semantic ideas.
> 
> This is, IMO, a big limitation since what is "linked" is not the element
> name but the semantic context it represents.
> 
> This would allow the creation of classes of equivalence for XML schemas,
> each one representing a different view of the same language independent
> semantic context they all share.

Ok, in principal, it's a nice vision. In practice, I doubt it's supportable.
The journalist will never edit xml directly, and if they did, would
constantly break application. Hence, you create interfaces for them. Hence,
the semantic context is protected . . . where the interface itself is
concerned . .  . below . . .


> Where would something like this be useful in Cocoon?
> 
> For all schemas used to generate the resources (user level) and for
> Cocoon's own schemas (mainly the sitemap and configurations).
> 
> For example, non-english-speakers could install and maintain Cocoon's
> sitemaps or, sitemaps with localized schemas can be given to people with
> different language skills.
> 
> Being completely "orthogonal" on the schema (this is why it needs to
> patch both namespaces and schema capabilities), this would positively
> impact on every XML usage.


Ok, I think you may be inventing where no invention is called for.

We're using schema annotations to provide the locale specific 'translation'
of the structure (both machine and human parts) to alleviate this problem.
That is, we maintian the semantic context, as you put it. Of course, we are
taking risks in using schema, but, what the hell . . .

The structure is usually (not always) in english, but is annotated. There's
no other way that doesn't produce more labour and confusion . . .

The point I'm making, below, is simple. XML schema is already expressive
enough to yeild all that you require. The real problem is that people need
to be trained to use them. We're building applications that will use schema
to make the editor easy to use (<apinfo> for labels), so, that should keep
the ordinary editor in the clear. It's the person responsible for the schema
in the first place that may be a problem.... but, an example . . .

Part of a schema which is used to:
1. instantiate an editor
2. constrain the validity of the document....

<xsd:annotation name"JOBINFORMATIONTYPE">
  <documenation xml:lang="en_US">
    <name="Job Information"/>
  </documentation>
<SNIP reason="sake of brevity"/>
  <appinfo xml:lang="en_US">
  <label="Job Information"/>
  </appinfo>
</xsd:annotation>

<xsd:complexType name="JOBINFORMATIONTYPE" >
  <xsd:element name="JOBTITLE"    type="xsd:string"      />
  <xsd:element name="LOCATION"    type="xsd:string"      />
  <xsd:element name="DEPARTMENT"  type="DEPARTMENTTYPE"  />
  <xsd:element name="DESCRIPTION" type="DESCRIPTIONTYPE" />
  <xsd:element name="CONTACTLIST" type="CONTACTLISTTYPE" />
  <xsd:element name="REFNUMBER"   type="xsd:integer"     />
  <xsd:element name="HOWTOAPPLY"  type="xsd:string"      />
  <xsd:element name="CONTACT"     type="CONTACTTYPE"     />
  <xsd:element name="CLOSINGDATE" type="CLOSINGDATETYPE" />
</xsd:complexType>

and the much less happy making:

<xsd:annotation name"DEPARTMENTTYPE">
  <documenation xml:lang="en">
    <name="Department Type"/>
    <values>
       <value> Development</value>
       <value> Finance </value>
       <value> Marketing </value>
       <value> Procurement </value>
       <value> Production </value>
       <value> Other </value>
    </values>
  </documentation>
  <documenation xml:lang="de_DE">
    <name="Department Type"/>
    <values>
       <value> Entwicklung </value>
       <value> Finanzen </value>
       <value> Marketing </value>
       <value> Beschaffung </value>
       <value> Produktion </value>
       <value> Anderes </value>
    </values>
  </documentation>
<SNIP reason="sake of brevity"
  <appinfo xml:lang="en_US">
  <label="Department Type"/>
    <values>
       <value> Development</value>
       <value> Finance </value>
       <value> Marketing </value>
       <value> Procurement </value>
       <value> Production </value>
       <value> Other </value>
    </values>
  </appinfo>
  <appinfo xml:lang="en_UK">
  <label="Department"/>
    <values>
       <value> Development</value>
       <value> Finance </value>
       <value> Marketing </value>
       <value> Procurement </value>
       <value> Production </value>
       <value> Other </value>
    </values>
  </appinfo>
  <appinfo xml:lang="de_DE">
  <label="Abteilung"/>
    <values>
       <value> Entwicklung </value>
       <value> Finanzen </value>
       <value> Marketing </value>
       <value> Beschaffung </value>
       <value> Produktion </value>
       <value> Anderes </value>
    </values>
  </appinfo>
<SNIP reason="sake of brevity"
</xsd:annotation>

<xsd:simpleType name="DEPARTMENTTYPE" base="xsd:String" >
 <xsd:enumeration value="Development" />
 <xsd:enumeration value="Finance" />
 <xsd:enumeration value="Marketing" />
 <xsd:enumeration value="Procurement" />
 <xsd:enumeration value="Production" />
 <xsd:enumeration value="Other" />
</xsd:simpleType>


Ok. So we lost your father. We also lost most of the employees of the
company in question. Sigh. But, while the above schema is getting verbose.
One CAN decipher much more easily than was the case with the DTDs I was
referring to earlier. Document editors need never decipher it, at all... in
our context, but I believe that's what applications are for...

I understand you're trying to work at the level of the element tag itself.
However, I don't think this is an issue. Namely. If the application being
developed is intiated in Italy, where the production facilities are staffed
by Italians, it's very likely that the schema and documents will be marked
up in Italian (as in the case of the sqml I've been reading in Swedish). As
long as they provide annotations, as need be, there really isn't a problem.
If I need to develop XSL, there will be a reference... If all I get is the
XML, of course, I conceed your point. But, then, I also can't validate their
documents, nor is their any 'reasonable' to create an editor for those
documents. So, they fall into the domain of the 'unregulated'. Or, if I'm
lucky, literature :) In the latter case, I'll haul out my dictionary :)

...

In my experience using columns from dbs, it's the same story. I just need a
map. I don't have a problem using the column names as they are, and don't
see a justification for translation that isn't outweighed by the maintenance
cost. While I'm not fond of working under pressure to develop apps that use
SQL statements in which I'm obliged to decipher Dutch, I'll live with it, if
only I get decent documentation.

...

When it comes to the vast majority of documents, it's arguable that their
VALUE is no so great as to justify translating their structure! Facilitating
the translation of their content, on the other hand, is our responsibility.

I really don't believe document editors are going to be plain text editors
for the ordinary users, rather, something akin to form editors . . . and
that brings me back to the combination of:
<apinfo> to facilitate, in our case, instantiating a localized interface
<documentation> to facilitate maintenance of the schema itself . . . by
whomever... ( which is true of document management in both cases,
eurofootball.com and in several probjects for saab automobile)


My immediate feedback . . . back to eurofootball . . . have to bring up the
full cocoon version, damn it!


As always, thanks for your thoughts.



> ------------------ o ------------------
> 
> Ok, but what can we do inside Cocoon without having to proprietarely
> extend the XML specifications?
> 
> Also, how can we simplify the sitemap evolution without compromising the
> rest of the system?
> 
> I think a possible solution is sitemap pluggability and compilation.
> 
> You could think at the sitemap like a big XSP taglib that is responsible
> to drive directly the execution of the resource creation pipelines.
> 
> It would also increase performance, since matching could be optimized
> and what not.
> 
> It would also allow different sitemap schemas to be developped. In
> theory, you could create your own sitemap schema.
> 
> Well, this collection of RT is admittedly wild.
> 
> Digest with caution but think about it extensively since I know many FS
> hides between the lines.

-- 
Mark (Poetaster) Washeim

'On the linen wrappings of certain mummified remains
found near the Etrurian coast are invaluable writings
that await translation.

Quem colorem habet sapientia?'

Evan S. Connell

 



Mime
View raw message