uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Proccesing Bamun characters
Date Mon, 12 Dec 2016 16:35:52 GMT
Hi Nelson,

Looking into this... Can you please confirm that the UTF-8 coding of the
troublesome characters, in hexadecimal, is:

F0 96 A6 80

F0 96 A6 90

EF BF BD

EF BF BD

If you have the string in Java, please try converting it to a UTF-8 string using
something like:
  byte[] theBytes = myTestString.getBytes("UTF-8");

  and then print out theBytes in hex; they should look like the above.  If not,
please let us know what the values is instead.


Thanks. -Marshall


On 12/9/2016 9:02 AM, nelson rivera wrote:
> Hi i was read your explication and saw the link, but in my case, i
> don't read any xml file. Just i copy the text, get a new input cas
> from UimaAsynchronousEngine with getCAS(), set the text in the cas and
> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
> side. Apparently the characters are changed for its entities
> corresponding when serialize the cas to send it, but i get the
> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
> columnNumber: 571; Character reference "&#"
> in uima-as framework installed when trying to deserialize the cas
> deserializeCasFromXmi(),to be processed for the service.
>
> 2016-12-08 16:48 GMT-05:00, Marshall Schor <msa@schor.com>:
>> Hi Nelson,
>>
>> I can't see the characters (sorry).
>>
>> This might be an issue caused by a discrepancy between the coding of the
>> file
>> being read, and the coding indicated on the xml header.  Can you check that
>> those two things are the same?
>>
>> See
>> http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
>> for example.
>>
>> -Marshall
>>
>> On 12/8/2016 4:20 PM, nelson rivera wrote:
>>> i tried to proccess the following text in a service deploy in uima-as,
>>> because is input of my application. This is the text : 𖦀  𖦐  �  �.
>>> These characters correspond to the bamun language, and apparently are
>>> not  invalid xml characters because tools such as browsers interpret
>>> it and show it. After get a new input cas to proccesing, set the text
>>> and send the request, i get  the exception that i show below in
>>> uima-as, the framework uima-as work and recovers correctly, just not
>>> process this characters.
>>> Could you tell me what happens with these characters, one of these is
>>> invalid characters for framework uima-as?
>>>
>>>
>>>
>>> 04:00:31.606 - 14:
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
>>> WARNING:
>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>>> Character reference "&#
>>>         at
>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
>>>         at
>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
>>>         at
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
>>>         at
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
>>>         at
>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
>>>         at
>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
>>>         at
>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)
>>>
>>


Mime
View raw message