xerces-j-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eddie Robertsson <eroberts...@allette.com.au>
Subject Re: Error reporting from XML Schema and from Schematron (long)
Date Fri, 21 Jun 2002 07:05:46 GMT
Hi Jan,

Rick Jelliffe asked me to forward the following to this list since he is 
a non-subscriber:


There has been talk of adding XPath to the locators used in
SAX.  That would be a great idea.  Line numbers are 
useful sometimes, and paths are useful at others, so IMHO we
need a SAX infrastructure that can provide either.

Personally, I think an error object should be able to provide
  - file/line/character number
  - XPath
  - severity indicator
  - sendor ID
  - nickname or error-code
  - single line overview
  - multiline diagnostic, XML
  - icon for that error
  - URL for see also
  - unique ID for keying a repair method
  - unique ID for diagnostic generating function

This would support Schematron and XSD well.

My company has also been using Xerces-J as well as
Schematron (and also RELAX NG and DTDS) in
an editor product now in beta testing.   

I had to rewrite almost all the Xerces error messages
because they were incomprehensible to end-users.
(I don't know if it is worthwhile contributing these,
because some of them are specific to our system
or leave out diagnostics of errors that cannot happen
for us.)

One improvement that I found useful was to first
classify all errors as either document errors
or schema errors. At the moment everything is
mixed together, and a layman has now way of 
knowing whether the document is bad or the 
schema is bad. So the first thing I did was
to prefix all schema errors with (Schema error).
Then I rewrote all the other errors for end-users,
in product specific terms.

Actually, I do tend to think that one should always
expect to rewrite error messages for a particular
system.  But for Xerces' case, it would be nice
if the messages were a little less programmer
oriented in the first-place.

The two worst offenders are:
 1) Error messages relating to the DOCTYPE 
declaration.  A missing system identifier in
the DOCTYPE declaration is diagnosed as
being caused by a missing space.  If there is
no entity, then IIRC the user gets a message
to the effect that there is  an error in "null".  
Problems that occur before the perceived
start of the document are very off-putting.

 2) The XSD error messages.  These are
fairly poor: you have to learn to ignore
the reference to the XSD outcome code
and the parenthetic content models at
the end. 

Finally, on the issue of migrating from
XSD to Schematron.  One thing that may
be helpful is Francis Norton's typeTagger.
This is an XSLT stylesheet that adds 
xsi:type attributes to a document, based
on an XSD schema.  So you can continue
to describe your basic structures and
datatypes using XSD, but give the
Schematron access to that typing 
  <rule context="*[@xsi:type='address']">
    <assert test="*[@xsi:type='street']"
    >A <name/> is an address, and so
   it needs some kind of street, for example &lt;strasse>.</assert>

Rick Jelliffe


Jan Dvorak wrote:

>Hello all,
>I'm facing the following problem, and I'd very much appreciate comments from 
>people on this list. I appologize for the longer post.
>We have constructed an XML Schema and a Schematron schema that both together 
>constrain the data we accept into an information system. Technically 
>speaking, it works great. We have an extensive suite of XML inputs and 
>corresponding expected error reports and we use these to test that the 
>checker does what it's supposed to do.
>Where this fails is in the error reports from the XML Schema validation. In 
>the Schematron reports we can speak in terms of the problem domain (data 
>about scientific projects, their participants and the financial support 
>thereof) - we write the messages ourselves. So we can e.g. report that a 
>project should specify the date when it started. That's understandable to all 
>our users and we can provide all useful diagnostics as to where the problem 
>is located. However, if we place this constraint in the XML Schema, all we 
>get is a cvc-something error report that says that the content of element 
>'lifecycle' doesn't match its model. This is accompanied with line and column 
>numbers. In this form, our users find it pretty much indigestible.
>The first idea I had was to run away from XML Schema, to place all 
>constraints in the Schematron schema. There might even be a way to 
>automatically generate the Schematron constraints from an XML Schema, 
>where we might be able to adjust the violation report texts. If we are sure 
>all constraints from the XML Schema are moved to Schematron, we could skip 
>the XML Schema validation step.
>However, moving all the constraints to Schematron would increase the 
>number of assertions from some 800 to some 6000 (est.) and that's a level of 
>complexity neither we, nor our customer can afford. We might also face 
>performance problems.
>The feasible way out of this seems that of gradually adding checks into the 
>Schematron schema to report violations there. We'll start with the most 
>frequent ones, and continue with those where the error reports are especially 
>cryptic. In the process, we would need to know in every moment that no error 
>remains unreported. We might report an error twice, but then a simple 
>correction - suppression of the report by XML Schema validator - should take 
>care of that. In the end, we might find that something like 30% of the 
>constraints are moved to Schematron.
>Now, we need to selectively suppress those XML Schema violations that will be 
>reported by Schematron. We can't move the XML Schema constraint types one by 
>one. It will always be a constraint type in a specific context (of an element 
>type, or of a XML Schema type).
>For that, we could use a common way of locating errors. I'm afraid that 
>getting the physical locations from Schematron is too difficult a task and 
>the result might not quite match the physical locations by Xerces. On the 
>other hand, Schematron can reliably produce 'logical' locations, something 
>like 'canonical XPath' to the node where the violation occurred. E.g. 
>'/root/a[1]/b[23]' meaning the 23rd 'b' child of the first 'a' child of 
>'root'. (Things are more difficult in the presence of namespaces, but still 
>How difficult would it be to extend Xerces to:
> (i) Produce 'logical' locations in terms of 'canonical' XPaths
>     as described above.
> (ii) Pass these locations to XMLErrorReporter.
>Then I could set up a filtering XMLErrorReporter that would let me gradually 
>move violation reports from XML Schema to Schematron. 
>Is there a better way to achieve our goal?
>Jan Dvorak
>MathAn Praha
>To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
>For additional commands, e-mail: xerces-j-user-help@xml.apache.org

To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

View raw message