uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nelson rivera <nelsonriver...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Fri, 16 Dec 2016 13:37:50 GMT
In Wikipedia the Bamum
Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
valid range is U+16800–U+16A3F, any of theses characters generate the
same log trace. I will continue to test the  Marshall Schor
suggestion.

2016-12-14 18:07 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> I think there's another problem ... the characters we have tested with are
> not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
> (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
> BD) is xFFFD.  This last one is the "replacement character" used when an
> illegal character is encountered.  According to Wikipedia the 88 Bamum
> characters are in the range xA6A0 - xA6F7.
>
> In order to reproduce your problem we need to yse the same codepoints.  Can
> you tell us what the hex value of the failing characters are, in UTF-8 or
> UTF-!6?
>
> By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not runAE,
> following the quick test described in the UIMA-AS README.
>
> On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <msa@schor.com> wrote:
>
>> Maybe we've been on the wrong line of thinking.
>>
>> Perhaps the translation between UTF-8 (during transportation) and the
>> string
>> characters is fine, but the XML parsing is restricting the character set
>> it uses.
>>
>> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>>
>> where it says valid xml characters exclude the "surrogates", which your
>> characters I think are.
>>
>> So, perhaps it's XML parsing which is complaining (and it appears this is
>> so,
>> from the stack trace).
>>
>> We should point out that UIMA's character offsets (like begin an end)
>> were
>> designed with Java String character offsets, and will perhaps not work
>> correctly
>> when surrogates are being used.
>>
>> A possible workaround for this particular issue may be to switch to
>> binary
>> serialization, instead of xmi serialization. This has a restriction in
>> that the
>> type systems much be identical (between the client and server).
>>
>> We could possibly get more confirmation of this hypothesis if you could
>> say what
>> the stack trace was, beyond the first bit which you stated in your
>> original
>> note.  There should be more stack trace information, further down,
>> starting with
>> "caused by ..." which may provide more helpful information.
>>
>> -Marshall
>>
>>
>> On 12/14/2016 9:38 AM, nelson rivera wrote:
>> > We also did that test with uima framework and RunAE tool and
>> > thecharacters in a file as you, and effectively not exist problem. The
>> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
>> > when trying to deserialize the cas deserializeCasFromXmi() in remote
>> > uima-as service, that  i get the mentioned exception
>> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
>> > Character reference "&#"
>> >
>> > In my case i don't read any file, not use FileSystemCollectionReader.
>> > The user introduces the text, the text is stored in string java
>> > (utf-16) and it set to the cas that will be processing, using
>> > setDocumentLanguage, then i send the cas.
>> >
>> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>> >> I put these 3 characters as UTF-8 in a file in examples/data and ran
>> >> the
>> >> MeetingDetector annotator as described in section 3.4 of the README,
>> adding
>> >> the option "-o out".  In that folder I found the returned results in
>> >> xmi
>> >> format with the characters in the sofaString element.  The relevant
>> part of
>> >> this file in hex is:
>> >>
>> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".........
>> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..&#10;"/><cas:V
>> >>
>> >> Note that the FileSystemCollectionReader by default uses the system
>> >> encoding but you could add a ConfigurationParameterSetting of UTF-8
>> >> for
>> the
>> >> Encoding parameter in its descriptor.
>> >>
>> >> With the client & server on different (Linux) machines I see no
>> >> problem
>> >> with sending UTF-8 characters.
>> >>
>> >>
>> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <msa@schor.com> wrote:
>> >>
>> >>> another question:  I assume there are perhaps 2 machines involved,
>> >>> here
>> >>> (it's a
>> >>> UIMA-AS setup).
>> >>>
>> >>> From the exception, it appears that the error happen when the client
>> >>> sends
>> >>> the
>> >>> CAS to the remote.
>> >>>
>> >>> Can you print out the Linux (assuming that's the OS) default locale
>> >>> for
>> >>> both
>> >>> machines?  (e.g. type into a command line "locale" and see what each
>> >>> machines
>> >>> has as its default character encoding).
>> >>>
>> >>> Please let us know what these are.
>> >>>
>> >>> Thanks. -Marshall
>> >>>
>> >>>
>> >>>
>> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
>> >>>> Yes these are the values of the troublesome characters, using
>> >>>> Integer.toHexString() to print out each byte, shows
>> >>>>
>> >>>> fffffff0 ffffff96 ffffffa6 ffffff80
>> >>>>
>> >>>> fffffff0 ffffff96 ffffffa6 ffffff90
>> >>>>
>> >>>> ffffffef ffffffbf ffffffbd
>> >>>>
>> >>>> ffffffef ffffffbf ffffffbd
>> >>>>
>> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <msa@schor.com>:
>> >>>>> Hi Nelson,
>> >>>>>
>> >>>>> Looking into this... Can you please confirm that the UTF-8 coding
>> >>>>> of
>> >>>>> the
>> >>>>> troublesome characters, in hexadecimal, is:
>> >>>>>
>> >>>>> F0 96 A6 80
>> >>>>>
>> >>>>> F0 96 A6 90
>> >>>>>
>> >>>>> EF BF BD
>> >>>>>
>> >>>>> EF BF BD
>> >>>>>
>> >>>>> If you have the string in Java, please try converting it to
a UTF-8
>> >>> string
>> >>>>> using
>> >>>>> something like:
>> >>>>>   byte[] theBytes = myTestString.getBytes("UTF-8");
>> >>>>>
>> >>>>>   and then print out theBytes in hex; they should look like
the
>> above.
>> >>> If
>> >>>>> not,
>> >>>>> please let us know what the values is instead.
>> >>>>>
>> >>>>>
>> >>>>> Thanks. -Marshall
>> >>>>>
>> >>>>>
>> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> >>>>>> Hi i was read your explication and saw the link, but in
my case, i
>> >>>>>> don't read any xml file. Just i copy the text, get a new
input cas
>> >>>>>> from UimaAsynchronousEngine with getCAS(), set the text
in the cas
>> >>>>>> and
>> >>>>>> send the request whit sendCAS(). I use uima-as API 2.9.0
in the
>> >>>>>> client
>> >>>>>> side. Apparently the characters are changed for its entities
>> >>>>>> corresponding when serialize the cas to send it, but i get
the
>> >>>>>> mentioned exception "org.xml.sax.SAXParseException; lineNumber:
1;
>> >>>>>> columnNumber: 571; Character reference "&#"
>> >>>>>> in uima-as framework installed when trying to deserialize
the cas
>> >>>>>> deserializeCasFromXmi(),to be processed for the service.
>> >>>>>>
>> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <msa@schor.com>:
>> >>>>>>> Hi Nelson,
>> >>>>>>>
>> >>>>>>> I can't see the characters (sorry).
>> >>>>>>>
>> >>>>>>> This might be an issue caused by a discrepancy between
the coding
>> of
>> >>> the
>> >>>>>>> file
>> >>>>>>> being read, and the coding indicated on the xml header.
 Can you
>> >>>>>>> check
>> >>>>>>> that
>> >>>>>>> those two things are the same?
>> >>>>>>>
>> >>>>>>> See
>> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is-
>> >>> the-encoding-in-the-xml-header
>> >>>>>>> for example.
>> >>>>>>>
>> >>>>>>> -Marshall
>> >>>>>>>
>> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
>> >>>>>>>> i tried to proccess the following text in a service
deploy in
>> >>> uima-as,
>> >>>>>>>> because is input of my application. This is the
text : 𖦀  𖦐  �
>> >>>>>>>> �.
>> >>>>>>>> These characters correspond to the bamun language,
and
>> >>>>>>>> apparently
>> >>>>>>>> are
>> >>>>>>>> not  invalid xml characters because tools such as
browsers
>> >>>>>>>> interpret
>> >>>>>>>> it and show it. After get a new input cas to proccesing,
set the
>> >>>>>>>> text
>> >>>>>>>> and send the request, i get  the exception that
i show below in
>> >>>>>>>> uima-as, the framework uima-as work and recovers
correctly, just
>> >>>>>>>> not
>> >>>>>>>> process this characters.
>> >>>>>>>> Could you tell me what happens with these characters,
one of
>> >>>>>>>> these
>> >>>>>>>> is
>> >>>>>>>> invalid characters for framework uima-as?
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 04:00:31.606 - 14:
>> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
>> >>> handleProcessRequestFromRemoteClient:
>> >>>>>>>> WARNING:
>> >>>>>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber:
571;
>> >>>>>>>> Character reference "&#
>> >>>>>>>>         at
>> >>>>>>>> com.sun.org.apache.xerces.internal.parsers.
>> AbstractSAXParser.parse(
>> >>> AbstractSAXParser.java:1239)
>> >>>>>>>>         at
>> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
>> >>> UimaSerializer.java:187)
>> >>>>>>>>         at
>> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
>> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_
>> impl.java:222)
>> >>>>>>>>         at
>> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
>> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
>> impl.java:552)
>> >>>>>>>>         at
>> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>> impl.handle(
>> >>> ProcessRequestHandler_impl.java:1090)
>> >>>>>>>>         at
>> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
>> >>> impl.handle(MetadataRequestHandler_impl.java:78)
>> >>>>>>>>         at
>> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
>> >>> onMessage(JmsInputChannel.java:731)
>> >>>
>>
>>
>

Mime
View raw message