uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burn Lewis <burnle...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Tue, 13 Dec 2016 20:10:21 GMT
I put these 3 characters as UTF-8 in a file in examples/data and ran the
MeetingDetector annotator as described in section 3.4 of the README, adding
the option "-o out".  In that folder I found the returned results in xmi
format with the characters in the sofaString element.  The relevant part of
this file in hex is:

000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".........
000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..&#10;"/><cas:V

Note that the FileSystemCollectionReader by default uses the system
encoding but you could add a ConfigurationParameterSetting of UTF-8 for the
Encoding parameter in its descriptor.

With the client & server on different (Linux) machines I see no problem
with sending UTF-8 characters.


On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <msa@schor.com> wrote:

> another question:  I assume there are perhaps 2 machines involved, here
> (it's a
> UIMA-AS setup).
>
> From the exception, it appears that the error happen when the client sends
> the
> CAS to the remote.
>
> Can you print out the Linux (assuming that's the OS) default locale for
> both
> machines?  (e.g. type into a command line "locale" and see what each
> machines
> has as its default character encoding).
>
> Please let us know what these are.
>
> Thanks. -Marshall
>
>
>
> On 12/12/2016 1:58 PM, nelson rivera wrote:
> > Yes these are the values of the troublesome characters, using
> > Integer.toHexString() to print out each byte, shows
> >
> > fffffff0 ffffff96 ffffffa6 ffffff80
> >
> > fffffff0 ffffff96 ffffffa6 ffffff90
> >
> > ffffffef ffffffbf ffffffbd
> >
> > ffffffef ffffffbf ffffffbd
> >
> > 2016-12-12 11:35 GMT-05:00, Marshall Schor <msa@schor.com>:
> >> Hi Nelson,
> >>
> >> Looking into this... Can you please confirm that the UTF-8 coding of the
> >> troublesome characters, in hexadecimal, is:
> >>
> >> F0 96 A6 80
> >>
> >> F0 96 A6 90
> >>
> >> EF BF BD
> >>
> >> EF BF BD
> >>
> >> If you have the string in Java, please try converting it to a UTF-8
> string
> >> using
> >> something like:
> >>   byte[] theBytes = myTestString.getBytes("UTF-8");
> >>
> >>   and then print out theBytes in hex; they should look like the above.
> If
> >> not,
> >> please let us know what the values is instead.
> >>
> >>
> >> Thanks. -Marshall
> >>
> >>
> >> On 12/9/2016 9:02 AM, nelson rivera wrote:
> >>> Hi i was read your explication and saw the link, but in my case, i
> >>> don't read any xml file. Just i copy the text, get a new input cas
> >>> from UimaAsynchronousEngine with getCAS(), set the text in the cas and
> >>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client
> >>> side. Apparently the characters are changed for its entities
> >>> corresponding when serialize the cas to send it, but i get the
> >>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1;
> >>> columnNumber: 571; Character reference "&#"
> >>> in uima-as framework installed when trying to deserialize the cas
> >>> deserializeCasFromXmi(),to be processed for the service.
> >>>
> >>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <msa@schor.com>:
> >>>> Hi Nelson,
> >>>>
> >>>> I can't see the characters (sorry).
> >>>>
> >>>> This might be an issue caused by a discrepancy between the coding of
> the
> >>>> file
> >>>> being read, and the coding indicated on the xml header.  Can you check
> >>>> that
> >>>> those two things are the same?
> >>>>
> >>>> See
> >>>> http://stackoverflow.com/questions/5165347/what-use-is-
> the-encoding-in-the-xml-header
> >>>> for example.
> >>>>
> >>>> -Marshall
> >>>>
> >>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
> >>>>> i tried to proccess the following text in a service deploy in
> uima-as,
> >>>>> because is input of my application. This is the text : 𖦀  𖦐
 �  �.
> >>>>> These characters correspond to the bamun language, and apparently
are
> >>>>> not  invalid xml characters because tools such as browsers interpret
> >>>>> it and show it. After get a new input cas to proccesing, set the
text
> >>>>> and send the request, i get  the exception that i show below in
> >>>>> uima-as, the framework uima-as work and recovers correctly, just
not
> >>>>> process this characters.
> >>>>> Could you tell me what happens with these characters, one of these
is
> >>>>> invalid characters for framework uima-as?
> >>>>>
> >>>>>
> >>>>>
> >>>>> 04:00:31.606 - 14:
> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient:
> >>>>> WARNING:
> >>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> >>>>> Character reference "&#
> >>>>>         at
> >>>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(
> AbstractSAXParser.java:1239)
> >>>>>         at
> >>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
> UimaSerializer.java:187)
> >>>>>         at
> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222)
> >>>>>         at
> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552)
> >>>>>         at
> >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(
> ProcessRequestHandler_impl.java:1090)
> >>>>>         at
> >>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
> impl.handle(MetadataRequestHandler_impl.java:78)
> >>>>>         at
> >>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> onMessage(JmsInputChannel.java:731)
> >>>>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message