forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Crossley <cross...@indexgeo.com.au>
Subject Re: Entities for characters references
Date Fri, 28 Jun 2002 08:40:20 GMT
J.Pietschmann wrote:
> Hi,
> I'd like to question the wisdom of adding the entity
> definitions for characters.

I too have often wondered how to deal with these
issues - not yet heard a good solution. I am certainly no
expect on this and would be pleased to hear from others.

> Rationale:
> - After a XML2XML transformation, the entities are lost
>   anyway.

Surely not "lost", rather "transformed" into something.
Cocoon has clever tools to see what is happening.
Use "cocoon-view" at various stages of a pipeline.
I will try some different examples and report back.

> - Every sane person uses a somewhat XML-aware editor for
>   editing xdocs, which can usually be configured to provide
>   support for entering special characters.

We cannot rely on that. People will edit XML docs with
various different tools, including a text-editor, e.g. vi
and poor XML export routines from crappy databases. I see
a mixture of ways to represent character entities in the
various XML docs that i encounter.

I often wonder if some character entities are used
unnecessarily. I mean, the XML document instance declares
a certain encoding. Surely a document would only need
character entities for chars that are outside that set.

Perhaps there could be some Ant task that prepares the
documents for processing by rationalising all those
"external" entities beforehand. If we also had a
pre-processing facility to validate the xdocs, then we
could rely on them for production-time.

There is a facility that has been talked about on the
xml-commons list called DoctypeChanger which can remove
the document type declaration. In that way the catalog
entity resolver would not be called into action during
production.

> - Usually, special characters are used rarely, and often
>   even the entity name has to be looked up. Can look up the
>   codepoint as well in this situation. Worse: my Unicode
>   lookup database doesn't list entity names, only codepoints.

I see special characters being used often. One source that
i need to deal with (an RDBMS export) represents every char
that is not strict alpha-numeric as a character entity, even
comma and full-stop. !!!

> - The entity definitions add 5 files and >20k to read.
>   This adds noticably to processing time. Compare to the size
>   of most documents.

Yes, this is an issue. I see no solution yet.

> - Adding the entities virtually *requires* catalog support.
>   (what's wrong with
>     a: putting the entity definition files in the DTD directory

That would just create a nightmare for DTD and entity
management. At least with Forrest we have them centralised.
However, as you indicate, then we must incur the "overhead"
of the catalog entity resolver. So yes, this is a solution
but not a desirable one.

>     b: putting the definitions in one file instead of 5?)

These ISO*.pen entity sets are published elsewhere as
separate sets. It is easier to keep them that way in CVS
so that we can remain in sync with any changes. If you know
of any official source that has them in one file
then we can adopt that. I think that OASIS is looking
to publish an official set http://www.oasis-open.org/
(though i cannot find the reference today).

> - Entities don't mix well with non-DTD validation.

Not sure what you mean here. I am yet to properly
experiment with Relax NG.

> Well, are the entities there just because "everybody does
> it" and "some years ago, two people actually complained"
> (probably about &nbsp; missing)?

They are there because processing and validation will
break without them.

> Has somebody checked the whole lot of the Apache how often
> the entities are actually used?

That would be an interesting exercise. Perhaps we could
try it out with the current set of Cocoon xdocs. Could
someone devise a stylesheet to detect and summarise them?

> Don't get me wrong, supporting entities for special characters
> appears to require some not quite trivial expense, I'd like
> to know whether there are actually benefits justifying this.
> 
> J.Pietschmann

It is a complex beast. I too would like it simplified.
--David



Mime
View raw message