uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nelson rivera <nelsonriver...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Mon, 12 Dec 2016 18:58:58 GMT
Yes these are the values of the troublesome characters, using
Integer.toHexString() to print out each byte, shows

fffffff0 ffffff96 ffffffa6 ffffff80

fffffff0 ffffff96 ffffffa6 ffffff90

ffffffef ffffffbf ffffffbd

ffffffef ffffffbf ffffffbd

2016-12-12 11:35 GMT-05:00, Marshall Schor <msa@schor.com>:
> Hi Nelson,
>
> Looking into this... Can you please confirm that the UTF-8 coding of the
> troublesome characters, in hexadecimal, is:
>
> F0 96 A6 80
>
> F0 96 A6 90
>
> EF BF BD
>
> EF BF BD
>
> If you have the string in Java, please try converting it to a UTF-8 string
> using
> something like:
>   byte[] theBytes = myTestString.getBytes("UTF-8");
>
>   and then print out theBytes in hex; they should look like the above.  If
> not,
> please let us know what the values is instead.
>
>
> Thanks. -Marshall
>
>
> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> Hi i was read your explication and saw the link, but in my case, i
>> don't read any xml file. Just i copy the text, get a new input cas
>> from UimaAsynchronousEngine with getCAS(), set the text in the cas and
>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
>> side. Apparently the characters are changed for its entities
>> corresponding when serialize the cas to send it, but i get the
>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
>> columnNumber: 571; Character reference "&#"
>> in uima-as framework installed when trying to deserialize the cas
>> deserializeCasFromXmi(),to be processed for the service.
>>
>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <msa@schor.com>:
>>> Hi Nelson,
>>>
>>> I can't see the characters (sorry).
>>>
>>> This might be an issue caused by a discrepancy between the coding of the
>>> file
>>> being read, and the coding indicated on the xml header.  Can you check
>>> that
>>> those two things are the same?
>>>
>>> See
>>> http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header
>>> for example.
>>>
>>> -Marshall
>>>
>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
>>>> i tried to proccess the following text in a service deploy in uima-as,
>>>> because is input of my application. This is the text : 𖦀  𖦐  �  �.
>>>> These characters correspond to the bamun language, and apparently are
>>>> not  invalid xml characters because tools such as browsers interpret
>>>> it and show it. After get a new input cas to proccesing, set the text
>>>> and send the request, i get  the exception that i show below in
>>>> uima-as, the framework uima-as work and recovers correctly, just not
>>>> process this characters.
>>>> Could you tell me what happens with these characters, one of these is
>>>> invalid characters for framework uima-as?
>>>>
>>>>
>>>>
>>>> 04:00:31.606 - 14:
>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient:
>>>> WARNING:
>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>>>> Character reference "&#
>>>>         at
>>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
>>>>         at
>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187)
>>>>         at
>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
>>>>         at
>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
>>>>         at
>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090)
>>>>         at
>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
>>>>         at
>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731)
>>>>
>>>
>
>

Mime
View raw message