uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: UIMA-AS: non-XML char in text raises SAXParseException
Date Fri, 21 Oct 2011 20:13:34 GMT
If you're using various "defaults", the serialization used is "XMI" , which,
indeed, does require that text data being serialized be valid XML characters. 
And I see this is what's being used , from the backtrace.

If you need to use UIMA-AS with invalid chars, you can do one of several things:

1) change the type of the data holding these from String to some form of byte
sequences.
2) change the way serialization is done among UIMA-AS components - there's a
"binary" serialization which might avoid this issue (it's faster, too, but it
has the drawback that the "client" and the "service" must have exactly the same
type system.

-Marshall

On 10/21/2011 1:58 PM, Charles Bearden wrote:
> I created a simple UIMA-AS pipeline comprising a collection reader and an
> aggregate AE, which I ran simply like so:
>
> runRemoteAsyncAE.sh tcp://localhost:61616 CollectionReader \
>   -d <deployment descriptor> \
>   -c <collection reader descriptor> \
>
> Evidently, the content I wish to process has some non-XML characters in it,
> because a certain bit of data raises an exception, the heart of which appears
> to be:
>
>   Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0
> character: , 0x19
>
> The complete exception is here:
>   <http://pastebin.com/rMPyAhqP>
>
> The point in my code at which the exception enters the picture
> (NoteLinesFromDBReader.java:139) is the point in the .getNext() method where I
> get the next CAS:
>   jcas = aCAS.getJCas();
>
> I don't run into this problem when I use the old-fashioned CPE, so my thinking
> is that the CAS from the CR is being serialized before being put into the
> queue. Is the expectation in UIMA AS that I sanitize text artifacts of non-XML
> characters before the CR gets them? Or am I doing something else wrong perhaps?
>
> Thanks for your help,
> Chuck

Mime
View raw message