If you're using various "defaults", the serialization used is "XMI" , which,
indeed, does require that text data being serialized be valid XML characters.
And I see this is what's being used , from the backtrace.
If you need to use UIMA-AS with invalid chars, you can do one of several things:
1) change the type of the data holding these from String to some form of byte
sequences.
2) change the way serialization is done among UIMA-AS components - there's a
"binary" serialization which might avoid this issue (it's faster, too, but it
has the drawback that the "client" and the "service" must have exactly the same
type system.
-Marshall
On 10/21/2011 1:58 PM, Charles Bearden wrote:
> I created a simple UIMA-AS pipeline comprising a collection reader and an
> aggregate AE, which I ran simply like so:
>
> runRemoteAsyncAE.sh tcp://localhost:61616 CollectionReader \
> -d <deployment descriptor> \
> -c <collection reader descriptor> \
>
> Evidently, the content I wish to process has some non-XML characters in it,
> because a certain bit of data raises an exception, the heart of which appears
> to be:
>
> Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0
> character: , 0x19
>
> The complete exception is here:
> <http://pastebin.com/rMPyAhqP>
>
> The point in my code at which the exception enters the picture
> (NoteLinesFromDBReader.java:139) is the point in the .getNext() method where I
> get the next CAS:
> jcas = aCAS.getJCas();
>
> I don't run into this problem when I use the old-fashioned CPE, so my thinking
> is that the CAS from the CR is being serialized before being put into the
> queue. Is the expectation in UIMA AS that I sanitize text artifacts of non-XML
> characters before the CR gets them? Or am I doing something else wrong perhaps?
>
> Thanks for your help,
> Chuck
|