uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nelson rivera <nelsonriver...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Wed, 14 Dec 2016 14:38:28 GMT
We also did that test with uima framework and RunAE tool and
thecharacters in a file as you, and effectively not exist problem. The
problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
when trying to deserialize the cas deserializeCasFromXmi() in remote
uima-as service, that  i get the mentioned exception
"org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
Character reference "&#"

In my case i don't read any file, not use FileSystemCollectionReader.
The user introduces the text, the text is stored in string java
(utf-16) and it set to the cas that will be processing, using
setDocumentLanguage, then i send the cas.

2016-12-13 15:10 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> I put these 3 characters as UTF-8 in a file in examples/data and ran the
> MeetingDetector annotator as described in section 3.4 of the README, adding
> the option "-o out".  In that folder I found the returned results in xmi
> format with the characters in the sofaString element.  The relevant part of
> this file in hex is:
>
> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".........
> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..&#10;"/><cas:V
>
> Note that the FileSystemCollectionReader by default uses the system
> encoding but you could add a ConfigurationParameterSetting of UTF-8 for the
> Encoding parameter in its descriptor.
>
> With the client & server on different (Linux) machines I see no problem
> with sending UTF-8 characters.
>
>
> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <msa@schor.com> wrote:
>
>> another question:  I assume there are perhaps 2 machines involved, here
>> (it's a
>> UIMA-AS setup).
>>
>> From the exception, it appears that the error happen when the client
>> sends
>> the
>> CAS to the remote.
>>
>> Can you print out the Linux (assuming that's the OS) default locale for
>> both
>> machines?  (e.g. type into a command line "locale" and see what each
>> machines
>> has as its default character encoding).
>>
>> Please let us know what these are.
>>
>> Thanks. -Marshall
>>
>>
>>
>> On 12/12/2016 1:58 PM, nelson rivera wrote:
>> > Yes these are the values of the troublesome characters, using
>> > Integer.toHexString() to print out each byte, shows
>> >
>> > fffffff0 ffffff96 ffffffa6 ffffff80
>> >
>> > fffffff0 ffffff96 ffffffa6 ffffff90
>> >
>> > ffffffef ffffffbf ffffffbd
>> >
>> > ffffffef ffffffbf ffffffbd
>> >
>> > 2016-12-12 11:35 GMT-05:00, Marshall Schor <msa@schor.com>:
>> >> Hi Nelson,
>> >>
>> >> Looking into this... Can you please confirm that the UTF-8 coding of
>> >> the
>> >> troublesome characters, in hexadecimal, is:
>> >>
>> >> F0 96 A6 80
>> >>
>> >> F0 96 A6 90
>> >>
>> >> EF BF BD
>> >>
>> >> EF BF BD
>> >>
>> >> If you have the string in Java, please try converting it to a UTF-8
>> string
>> >> using
>> >> something like:
>> >>   byte[] theBytes = myTestString.getBytes("UTF-8");
>> >>
>> >>   and then print out theBytes in hex; they should look like the above.
>> If
>> >> not,
>> >> please let us know what the values is instead.
>> >>
>> >>
>> >> Thanks. -Marshall
>> >>
>> >>
>> >> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> >>> Hi i was read your explication and saw the link, but in my case, i
>> >>> don't read any xml file. Just i copy the text, get a new input cas
>> >>> from UimaAsynchronousEngine with getCAS(), set the text in the cas
>> >>> and
>> >>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the
>> >>> client
>> >>> side. Apparently the characters are changed for its entities
>> >>> corresponding when serialize the cas to send it, but i get the
>> >>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
>> >>> columnNumber: 571; Character reference "&#"
>> >>> in uima-as framework installed when trying to deserialize the cas
>> >>> deserializeCasFromXmi(),to be processed for the service.
>> >>>
>> >>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <msa@schor.com>:
>> >>>> Hi Nelson,
>> >>>>
>> >>>> I can't see the characters (sorry).
>> >>>>
>> >>>> This might be an issue caused by a discrepancy between the coding
of
>> the
>> >>>> file
>> >>>> being read, and the coding indicated on the xml header.  Can you
>> >>>> check
>> >>>> that
>> >>>> those two things are the same?
>> >>>>
>> >>>> See
>> >>>> http://stackoverflow.com/questions/5165347/what-use-is-
>> the-encoding-in-the-xml-header
>> >>>> for example.
>> >>>>
>> >>>> -Marshall
>> >>>>
>> >>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
>> >>>>> i tried to proccess the following text in a service deploy in
>> uima-as,
>> >>>>> because is input of my application. This is the text : 𖦀
 𖦐  �
>> >>>>> �.
>> >>>>> These characters correspond to the bamun language, and apparently
>> >>>>> are
>> >>>>> not  invalid xml characters because tools such as browsers
>> >>>>> interpret
>> >>>>> it and show it. After get a new input cas to proccesing, set
the
>> >>>>> text
>> >>>>> and send the request, i get  the exception that i show below
in
>> >>>>> uima-as, the framework uima-as work and recovers correctly,
just
>> >>>>> not
>> >>>>> process this characters.
>> >>>>> Could you tell me what happens with these characters, one of
these
>> >>>>> is
>> >>>>> invalid characters for framework uima-as?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> 04:00:31.606 - 14:
>> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
>> handleProcessRequestFromRemoteClient:
>> >>>>> WARNING:
>> >>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber:
571;
>> >>>>> Character reference "&#
>> >>>>>         at
>> >>>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(
>> AbstractSAXParser.java:1239)
>> >>>>>         at
>> >>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
>> UimaSerializer.java:187)
>> >>>>>         at
>> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
>> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
>> >>>>>         at
>> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
>> >>>>>         at
>> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(
>> ProcessRequestHandler_impl.java:1090)
>> >>>>>         at
>> >>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
>> impl.handle(MetadataRequestHandler_impl.java:78)
>> >>>>>         at
>> >>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
>> onMessage(JmsInputChannel.java:731)
>> >>>>>
>> >>
>>
>>
>

Mime
View raw message