uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eddie Epstein (JIRA)" <uima-...@incubator.apache.org>
Subject [jira] Reopened: (UIMA-387) XMI Serializer can write invalid control characters
Date Sun, 10 Jun 2007 11:52:27 GMT

     [ https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eddie Epstein reopened UIMA-387:
--------------------------------


The XMI format CAS is the standard for data exchange between UIMA compliant components. It
seems fundamentally wrong to allow an annotator to create an invalid String that cannot be
represented by the XMI format. The Apache UIMA Java and C++ frameworks support 8, 16, 32 and
64 bit array types for arbitrary data content. 

Eliminating bad characters only when serialized to XMI leads to different behaviors when analytics
are deployed as services vs when all are colocated. Why isn't the best thing to eliminate
bad characters as soon as possible, when String features are created? 

An exception should be thrown when detecting bad characters. For backwards compatibility,
annotators running in the IBM compatibility wrapper would not get the exception.

Comments?

> XMI Serializer can write invalid control characters
> ---------------------------------------------------
>
>                 Key: UIMA-387
>                 URL: https://issues.apache.org/jira/browse/UIMA-387
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.1
>            Reporter: Adam Lally
>            Assignee: Thilo Goetz
>             Fix For: 2.2
>
>
> On 5/1/07, Leo Ferres <lferres@ccs.carleton.ca> wrote:
> > Hello,
> >
> > While trying to open an xmi file after processing in xml view, an
> > error pops up telling me that there is an invalid &#26 xml character.
> > the error comes from the sax parser. Below is the stack trace. Thanks
> > very much for your help,
> >
> Most control characters are not allowed in XML 1.0, even if they are
> escaped with &#xxx.  If your input document contains such characters,
> the XMI CAS serializer is writing them to the output XMI document,
> making it unreadable.
> I checked that if you edit the XMI document and change the first line to:
> <?xml version="1.1" encoding="UTF-8"?>
> The problem goes away, because XML version 1.1 does allow escaped
> control characters.
> So one possibility for us to fix this in UIMA is to have the XMI CAS
> Serializer generate XML version 1.1 tag by default.  (I think we
> considered that before and decided not to for some reason, maybe we
> were worried that other applications might not be able to consume XML
> 1.1?  I can't remember. :)
> Another possibility would be to have the XMI serializer automatically
> replace these characters with spaces.  The XCAS (not XMI) serializer
> does that, but only for the document text, not for feature values.  We
> could also serialize the XMI using XML version 1.1, which allows
> escaped control characters (but still not the 0x00 character).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message