uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nelson rivera <nelsonriver...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Mon, 19 Dec 2016 15:03:46 GMT
I understand, and yes, these characters should not appear in the
serialized cas, but they appear using
XmiCasSerializer.serialize(cas.getCas(), outStream):

...<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView"
mimeType="text" sofaString="&#55322;&#56704;  &#55322;&#56720;  �
�"/>...

In my application not use FileSystemCollectionReader.
The user introduces the text, the text is stored in string java
(utf-16) and it set to the cas that will be processing, using
setDocumentLanguage, then i send the cas.

2016-12-18 23:06 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> Since these characters are above the basic UTF-16 limit they are
> represented as 2 UTF-16 characters with high & low surrogate prefixes.  So
> 55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate
> prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as
> 6980, and after adding 2*16 (since only characters above this need
> surrogate pairs) we have the expected x16980.
> So one mystery is: their appearance in the CAS with the &# notation.  When
> I dump the CAS in the FileSystemCollectionReader I see the UTF-8 character,
> e.g. in hex  f096 a680 f096 a690.
> What collection reader are you using?
>
> On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera <nelsonrivera12@gmail.com>
> wrote:
>
>> This is the cas serialize to xmi before send to uima-as service,
>> serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
>> The representation of the characters In this serialization does not
>> match with the representation of characters with problems. It's being
>> serialized the code points escape sequences corresponding to the Bamum
>> characters, two code point by each character.
>> Why can this happen? Any suggestions
>>
>> <?xml version="1.0" encoding="UTF-8"?><xmi:XMI
>> xmlns:cas="http:///uima/cas.ecore" xmlns:xmi="http://www.omg.org/XMI"
>> xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore"
>> xmlns:tcas="http:///uima/tcas.ecore"
>> xmlns:api="http:///cu/datys/xinetica/uima/api.ecore"
>> xmi:version="2.0"><cas:NULL xmi:id="0"/><tcas:DocumentAnnotation
>> xmi:id="8" sofa="1" begin="0" end="12"
>> language="x-unspecified"/><cas:Sofa xmi:id="1" sofaNum="1"
>> sofaID="_InitialView" mimeType="text" sofaString="&#55322;&#56704;
>> &#55322;&#56720;  �  �"/><cas:View sofa="1" members="8"/></xmi:XMI>
>>
>>
>> 2016-12-16 14:06 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>> > Sorry, I missed the supplement set.  So the tests I did with x16980 &
>> > x16990 are valid.  runRemoteAsyncAE uses the same
>> > FileSystemCollectionReader as runAE does ... did you use a different
>> > collection reader?  If a custom one perhaps you could serialize the cas
>> to
>> > a file as XMI and verify that the XMI is legal.
>> >
>> > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera
>> > <nelsonrivera12@gmail.com
>> >
>> > wrote:
>> >
>> >> In Wikipedia the Bamum
>> >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
>> >> valid range is U+16800–U+16A3F, any of theses characters generate the
>> >> same log trace. I will continue to test the  Marshall Schor
>> >> suggestion.
>> >>
>> >> 2016-12-14 18:07 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>> >> > I think there's another problem ... the characters we have tested
>> >> > with
>> >> are
>> >> > not in the Bamum unicode set.  The first 2 that Marshall listed in
>> >> > utf-8
>> >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and
the 3rd
>> >> > (EF
>> >> > BF
>> >> > BD) is xFFFD.  This last one is the "replacement character" used
>> >> > when
>> >> > an
>> >> > illegal character is encountered.  According to Wikipedia the 88
>> >> > Bamum
>> >> > characters are in the range xA6A0 - xA6F7.
>> >> >
>> >> > In order to reproduce your problem we need to yse the same
>> >> > codepoints.
>> >> Can
>> >> > you tell us what the hex value of the failing characters are, in
>> >> > UTF-8
>> >> > or
>> >> > UTF-!6?
>> >> >
>> >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
>> >> runAE,
>> >> > following the quick test described in the UIMA-AS README.
>> >> >
>> >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <msa@schor.com>
>> wrote:
>> >> >
>> >> >> Maybe we've been on the wrong line of thinking.
>> >> >>
>> >> >> Perhaps the translation between UTF-8 (during transportation) and
>> >> >> the
>> >> >> string
>> >> >> characters is fine, but the XML parsing is restricting the
>> >> >> character
>> >> >> set
>> >> >> it uses.
>> >> >>
>> >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>> >> >>
>> >> >> where it says valid xml characters exclude the "surrogates", which
>> >> >> your
>> >> >> characters I think are.
>> >> >>
>> >> >> So, perhaps it's XML parsing which is complaining (and it appears
>> this
>> >> is
>> >> >> so,
>> >> >> from the stack trace).
>> >> >>
>> >> >> We should point out that UIMA's character offsets (like begin an
>> >> >> end)
>> >> >> were
>> >> >> designed with Java String character offsets, and will perhaps not
>> work
>> >> >> correctly
>> >> >> when surrogates are being used.
>> >> >>
>> >> >> A possible workaround for this particular issue may be to switch
to
>> >> >> binary
>> >> >> serialization, instead of xmi serialization. This has a restriction
>> in
>> >> >> that the
>> >> >> type systems much be identical (between the client and server).
>> >> >>
>> >> >> We could possibly get more confirmation of this hypothesis if you
>> >> >> could
>> >> >> say what
>> >> >> the stack trace was, beyond the first bit which you stated in your
>> >> >> original
>> >> >> note.  There should be more stack trace information, further down,
>> >> >> starting with
>> >> >> "caused by ..." which may provide more helpful information.
>> >> >>
>> >> >> -Marshall
>> >> >>
>> >> >>
>> >> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
>> >> >> > We also did that test with uima framework and RunAE tool and
>> >> >> > thecharacters in a file as you, and effectively not exist
>> >> >> > problem.
>> >> >> > The
>> >> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine
>> >> >> > and
>> >> >> > when trying to deserialize the cas deserializeCasFromXmi()
in
>> remote
>> >> >> > uima-as service, that  i get the mentioned exception
>> >> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber:
571;
>> >> >> > Character reference "&#"
>> >> >> >
>> >> >> > In my case i don't read any file, not use
>> >> >> > FileSystemCollectionReader.
>> >> >> > The user introduces the text, the text is stored in string
java
>> >> >> > (utf-16) and it set to the cas that will be processing, using
>> >> >> > setDocumentLanguage, then i send the cas.
>> >> >> >
>> >> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>> >> >> >> I put these 3 characters as UTF-8 in a file in examples/data
and
>> >> >> >> ran
>> >> >> >> the
>> >> >> >> MeetingDetector annotator as described in section 3.4
of the
>> >> >> >> README,
>> >> >> adding
>> >> >> >> the option "-o out".  In that folder I found the returned
>> >> >> >> results
>> >> >> >> in
>> >> >> >> xmi
>> >> >> >> format with the characters in the sofaString element.
 The
>> relevant
>> >> >> part of
>> >> >> >> this file in hex is:
>> >> >> >>
>> >> >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*
>> >> >> >> tring=".........
>> >> >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56
>> >> >> >> ..&#10;"/><cas:V
>> >> >> >>
>> >> >> >> Note that the FileSystemCollectionReader by default uses
the
>> system
>> >> >> >> encoding but you could add a ConfigurationParameterSetting
of
>> UTF-8
>> >> >> >> for
>> >> >> the
>> >> >> >> Encoding parameter in its descriptor.
>> >> >> >>
>> >> >> >> With the client & server on different (Linux) machines
I see no
>> >> >> >> problem
>> >> >> >> with sending UTF-8 characters.
>> >> >> >>
>> >> >> >>
>> >> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <msa@schor.com>
>> >> wrote:
>> >> >> >>
>> >> >> >>> another question:  I assume there are perhaps 2 machines
>> involved,
>> >> >> >>> here
>> >> >> >>> (it's a
>> >> >> >>> UIMA-AS setup).
>> >> >> >>>
>> >> >> >>> From the exception, it appears that the error happen
when the
>> >> >> >>> client
>> >> >> >>> sends
>> >> >> >>> the
>> >> >> >>> CAS to the remote.
>> >> >> >>>
>> >> >> >>> Can you print out the Linux (assuming that's the OS)
default
>> >> >> >>> locale
>> >> >> >>> for
>> >> >> >>> both
>> >> >> >>> machines?  (e.g. type into a command line "locale"
and see what
>> >> >> >>> each
>> >> >> >>> machines
>> >> >> >>> has as its default character encoding).
>> >> >> >>>
>> >> >> >>> Please let us know what these are.
>> >> >> >>>
>> >> >> >>> Thanks. -Marshall
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
>> >> >> >>>> Yes these are the values of the troublesome characters,
using
>> >> >> >>>> Integer.toHexString() to print out each byte,
shows
>> >> >> >>>>
>> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80
>> >> >> >>>>
>> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90
>> >> >> >>>>
>> >> >> >>>> ffffffef ffffffbf ffffffbd
>> >> >> >>>>
>> >> >> >>>> ffffffef ffffffbf ffffffbd
>> >> >> >>>>
>> >> >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <msa@schor.com>:
>> >> >> >>>>> Hi Nelson,
>> >> >> >>>>>
>> >> >> >>>>> Looking into this... Can you please confirm
that the UTF-8
>> >> >> >>>>> coding
>> >> >> >>>>> of
>> >> >> >>>>> the
>> >> >> >>>>> troublesome characters, in hexadecimal, is:
>> >> >> >>>>>
>> >> >> >>>>> F0 96 A6 80
>> >> >> >>>>>
>> >> >> >>>>> F0 96 A6 90
>> >> >> >>>>>
>> >> >> >>>>> EF BF BD
>> >> >> >>>>>
>> >> >> >>>>> EF BF BD
>> >> >> >>>>>
>> >> >> >>>>> If you have the string in Java, please try
converting it to a
>> >> UTF-8
>> >> >> >>> string
>> >> >> >>>>> using
>> >> >> >>>>> something like:
>> >> >> >>>>>   byte[] theBytes = myTestString.getBytes("UTF-8");
>> >> >> >>>>>
>> >> >> >>>>>   and then print out theBytes in hex; they
should look like
>> >> >> >>>>> the
>> >> >> above.
>> >> >> >>> If
>> >> >> >>>>> not,
>> >> >> >>>>> please let us know what the values is instead.
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> Thanks. -Marshall
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote:
>> >> >> >>>>>> Hi i was read your explication and saw
the link, but in my
>> >> >> >>>>>> case,
>> >> i
>> >> >> >>>>>> don't read any xml file. Just i copy the
text, get a new
>> >> >> >>>>>> input
>> >> cas
>> >> >> >>>>>> from UimaAsynchronousEngine with getCAS(),
set the text in
>> >> >> >>>>>> the
>> >> cas
>> >> >> >>>>>> and
>> >> >> >>>>>> send the request whit sendCAS(). I use
uima-as API 2.9.0 in
>> the
>> >> >> >>>>>> client
>> >> >> >>>>>> side. Apparently the characters are changed
for its entities
>> >> >> >>>>>> corresponding when serialize the cas to
send it, but i get
>> >> >> >>>>>> the
>> >> >> >>>>>> mentioned exception "org.xml.sax.SAXParseException;
>> lineNumber:
>> >> 1;
>> >> >> >>>>>> columnNumber: 571; Character reference
"&#"
>> >> >> >>>>>> in uima-as framework installed when trying
to deserialize
>> >> >> >>>>>> the
>> >> >> >>>>>> cas
>> >> >> >>>>>> deserializeCasFromXmi(),to be processed
for the service.
>> >> >> >>>>>>
>> >> >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor
<msa@schor.com>:
>> >> >> >>>>>>> Hi Nelson,
>> >> >> >>>>>>>
>> >> >> >>>>>>> I can't see the characters (sorry).
>> >> >> >>>>>>>
>> >> >> >>>>>>> This might be an issue caused by a
discrepancy between the
>> >> coding
>> >> >> of
>> >> >> >>> the
>> >> >> >>>>>>> file
>> >> >> >>>>>>> being read, and the coding indicated
on the xml header.
>> >> >> >>>>>>> Can
>> >> >> >>>>>>> you
>> >> >> >>>>>>> check
>> >> >> >>>>>>> that
>> >> >> >>>>>>> those two things are the same?
>> >> >> >>>>>>>
>> >> >> >>>>>>> See
>> >> >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is-
>> >> >> >>> the-encoding-in-the-xml-header
>> >> >> >>>>>>> for example.
>> >> >> >>>>>>>
>> >> >> >>>>>>> -Marshall
>> >> >> >>>>>>>
>> >> >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera
wrote:
>> >> >> >>>>>>>> i tried to proccess the following
text in a service deploy
>> in
>> >> >> >>> uima-as,
>> >> >> >>>>>>>> because is input of my application.
This is the text : 𖦀
>> 𖦐
>> >> �
>> >> >> >>>>>>>> �.
>> >> >> >>>>>>>> These characters correspond to
the bamun language, and
>> >> >> >>>>>>>> apparently
>> >> >> >>>>>>>> are
>> >> >> >>>>>>>> not  invalid xml characters because
tools such as browsers
>> >> >> >>>>>>>> interpret
>> >> >> >>>>>>>> it and show it. After get a new
input cas to proccesing,
>> >> >> >>>>>>>> set
>> >> the
>> >> >> >>>>>>>> text
>> >> >> >>>>>>>> and send the request, i get  the
exception that i show
>> >> >> >>>>>>>> below
>> >> >> >>>>>>>> in
>> >> >> >>>>>>>> uima-as, the framework uima-as
work and recovers
>> >> >> >>>>>>>> correctly,
>> >> just
>> >> >> >>>>>>>> not
>> >> >> >>>>>>>> process this characters.
>> >> >> >>>>>>>> Could you tell me what happens
with these characters, one
>> >> >> >>>>>>>> of
>> >> >> >>>>>>>> these
>> >> >> >>>>>>>> is
>> >> >> >>>>>>>> invalid characters for framework
uima-as?
>> >> >> >>>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> 04:00:31.606 - 14:
>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>> impl.
>> >> >> >>> handleProcessRequestFromRemoteClient:
>> >> >> >>>>>>>> WARNING:
>> >> >> >>>>>>>> org.xml.sax.SAXParseException;
lineNumber: 1;
>> >> >> >>>>>>>> columnNumber:
>> >> 571;
>> >> >> >>>>>>>> Character reference "&#
>> >> >> >>>>>>>>         at
>> >> >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers.
>> >> >> AbstractSAXParser.parse(
>> >> >> >>> AbstractSAXParser.java:1239)
>> >> >> >>>>>>>>         at
>> >> >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
>> >> >> >>> UimaSerializer.java:187)
>> >> >> >>>>>>>>         at
>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>> impl.
>> >> >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_
>> >> >> impl.java:222)
>> >> >> >>>>>>>>         at
>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>> impl.
>> >> >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
>> >> >> impl.java:552)
>> >> >> >>>>>>>>         at
>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>> >> >> impl.handle(
>> >> >> >>> ProcessRequestHandler_impl.java:1090)
>> >> >> >>>>>>>>         at
>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
>> >> >> >>> impl.handle(MetadataRequestHandler_impl.java:78)
>> >> >> >>>>>>>>         at
>> >> >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
>> >> >> >>> onMessage(JmsInputChannel.java:731)
>> >> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >
>>
>

Mime
View raw message