uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burn Lewis <burnle...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Mon, 19 Dec 2016 04:06:17 GMT
Since these characters are above the basic UTF-16 limit they are
represented as 2 UTF-16 characters with high & low surrogate prefixes.  So
55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate
prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as
6980, and after adding 2*16 (since only characters above this need
surrogate pairs) we have the expected x16980.
So one mystery is: their appearance in the CAS with the &# notation.  When
I dump the CAS in the FileSystemCollectionReader I see the UTF-8 character,
e.g. in hex  f096 a680 f096 a690.
What collection reader are you using?

On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera <nelsonrivera12@gmail.com>
wrote:

> This is the cas serialize to xmi before send to uima-as service,
> serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
> The representation of the characters In this serialization does not
> match with the representation of characters with problems. It's being
> serialized the code points escape sequences corresponding to the Bamum
> characters, two code point by each character.
> Why can this happen? Any suggestions
>
> <?xml version="1.0" encoding="UTF-8"?><xmi:XMI
> xmlns:cas="http:///uima/cas.ecore" xmlns:xmi="http://www.omg.org/XMI"
> xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore"
> xmlns:tcas="http:///uima/tcas.ecore"
> xmlns:api="http:///cu/datys/xinetica/uima/api.ecore"
> xmi:version="2.0"><cas:NULL xmi:id="0"/><tcas:DocumentAnnotation
> xmi:id="8" sofa="1" begin="0" end="12"
> language="x-unspecified"/><cas:Sofa xmi:id="1" sofaNum="1"
> sofaID="_InitialView" mimeType="text" sofaString="&#55322;&#56704;
> &#55322;&#56720;  �  �"/><cas:View sofa="1" members="8"/></xmi:XMI>
>
>
> 2016-12-16 14:06 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> > Sorry, I missed the supplement set.  So the tests I did with x16980 &
> > x16990 are valid.  runRemoteAsyncAE uses the same
> > FileSystemCollectionReader as runAE does ... did you use a different
> > collection reader?  If a custom one perhaps you could serialize the cas
> to
> > a file as XMI and verify that the XMI is legal.
> >
> > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera <nelsonrivera12@gmail.com
> >
> > wrote:
> >
> >> In Wikipedia the Bamum
> >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
> >> valid range is U+16800–U+16A3F, any of theses characters generate the
> >> same log trace. I will continue to test the  Marshall Schor
> >> suggestion.
> >>
> >> 2016-12-14 18:07 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> >> > I think there's another problem ... the characters we have tested with
> >> are
> >> > not in the Bamum unicode set.  The first 2 that Marshall listed in
> >> > utf-8
> >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the
3rd (EF
> >> > BF
> >> > BD) is xFFFD.  This last one is the "replacement character" used when
> >> > an
> >> > illegal character is encountered.  According to Wikipedia the 88 Bamum
> >> > characters are in the range xA6A0 - xA6F7.
> >> >
> >> > In order to reproduce your problem we need to yse the same codepoints.
> >> Can
> >> > you tell us what the hex value of the failing characters are, in UTF-8
> >> > or
> >> > UTF-!6?
> >> >
> >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
> >> runAE,
> >> > following the quick test described in the UIMA-AS README.
> >> >
> >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <msa@schor.com>
> wrote:
> >> >
> >> >> Maybe we've been on the wrong line of thinking.
> >> >>
> >> >> Perhaps the translation between UTF-8 (during transportation) and the
> >> >> string
> >> >> characters is fine, but the XML parsing is restricting the character
> >> >> set
> >> >> it uses.
> >> >>
> >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
> >> >>
> >> >> where it says valid xml characters exclude the "surrogates", which
> >> >> your
> >> >> characters I think are.
> >> >>
> >> >> So, perhaps it's XML parsing which is complaining (and it appears
> this
> >> is
> >> >> so,
> >> >> from the stack trace).
> >> >>
> >> >> We should point out that UIMA's character offsets (like begin an end)
> >> >> were
> >> >> designed with Java String character offsets, and will perhaps not
> work
> >> >> correctly
> >> >> when surrogates are being used.
> >> >>
> >> >> A possible workaround for this particular issue may be to switch to
> >> >> binary
> >> >> serialization, instead of xmi serialization. This has a restriction
> in
> >> >> that the
> >> >> type systems much be identical (between the client and server).
> >> >>
> >> >> We could possibly get more confirmation of this hypothesis if you
> >> >> could
> >> >> say what
> >> >> the stack trace was, beyond the first bit which you stated in your
> >> >> original
> >> >> note.  There should be more stack trace information, further down,
> >> >> starting with
> >> >> "caused by ..." which may provide more helpful information.
> >> >>
> >> >> -Marshall
> >> >>
> >> >>
> >> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
> >> >> > We also did that test with uima framework and RunAE tool and
> >> >> > thecharacters in a file as you, and effectively not exist problem.
> >> >> > The
> >> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine
and
> >> >> > when trying to deserialize the cas deserializeCasFromXmi() in
> remote
> >> >> > uima-as service, that  i get the mentioned exception
> >> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> >> >> > Character reference "&#"
> >> >> >
> >> >> > In my case i don't read any file, not use
> >> >> > FileSystemCollectionReader.
> >> >> > The user introduces the text, the text is stored in string java
> >> >> > (utf-16) and it set to the cas that will be processing, using
> >> >> > setDocumentLanguage, then i send the cas.
> >> >> >
> >> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> >> >> >> I put these 3 characters as UTF-8 in a file in examples/data
and
> >> >> >> ran
> >> >> >> the
> >> >> >> MeetingDetector annotator as described in section 3.4 of the
> >> >> >> README,
> >> >> adding
> >> >> >> the option "-o out".  In that folder I found the returned
results
> >> >> >> in
> >> >> >> xmi
> >> >> >> format with the characters in the sofaString element.  The
> relevant
> >> >> part of
> >> >> >> this file in hex is:
> >> >> >>
> >> >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*
> >> >> >> tring=".........
> >> >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56
> >> >> >> ..&#10;"/><cas:V
> >> >> >>
> >> >> >> Note that the FileSystemCollectionReader by default uses the
> system
> >> >> >> encoding but you could add a ConfigurationParameterSetting
of
> UTF-8
> >> >> >> for
> >> >> the
> >> >> >> Encoding parameter in its descriptor.
> >> >> >>
> >> >> >> With the client & server on different (Linux) machines
I see no
> >> >> >> problem
> >> >> >> with sending UTF-8 characters.
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <msa@schor.com>
> >> wrote:
> >> >> >>
> >> >> >>> another question:  I assume there are perhaps 2 machines
> involved,
> >> >> >>> here
> >> >> >>> (it's a
> >> >> >>> UIMA-AS setup).
> >> >> >>>
> >> >> >>> From the exception, it appears that the error happen when
the
> >> >> >>> client
> >> >> >>> sends
> >> >> >>> the
> >> >> >>> CAS to the remote.
> >> >> >>>
> >> >> >>> Can you print out the Linux (assuming that's the OS) default
> >> >> >>> locale
> >> >> >>> for
> >> >> >>> both
> >> >> >>> machines?  (e.g. type into a command line "locale" and
see what
> >> >> >>> each
> >> >> >>> machines
> >> >> >>> has as its default character encoding).
> >> >> >>>
> >> >> >>> Please let us know what these are.
> >> >> >>>
> >> >> >>> Thanks. -Marshall
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
> >> >> >>>> Yes these are the values of the troublesome characters,
using
> >> >> >>>> Integer.toHexString() to print out each byte, shows
> >> >> >>>>
> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80
> >> >> >>>>
> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90
> >> >> >>>>
> >> >> >>>> ffffffef ffffffbf ffffffbd
> >> >> >>>>
> >> >> >>>> ffffffef ffffffbf ffffffbd
> >> >> >>>>
> >> >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <msa@schor.com>:
> >> >> >>>>> Hi Nelson,
> >> >> >>>>>
> >> >> >>>>> Looking into this... Can you please confirm that
the UTF-8
> >> >> >>>>> coding
> >> >> >>>>> of
> >> >> >>>>> the
> >> >> >>>>> troublesome characters, in hexadecimal, is:
> >> >> >>>>>
> >> >> >>>>> F0 96 A6 80
> >> >> >>>>>
> >> >> >>>>> F0 96 A6 90
> >> >> >>>>>
> >> >> >>>>> EF BF BD
> >> >> >>>>>
> >> >> >>>>> EF BF BD
> >> >> >>>>>
> >> >> >>>>> If you have the string in Java, please try converting
it to a
> >> UTF-8
> >> >> >>> string
> >> >> >>>>> using
> >> >> >>>>> something like:
> >> >> >>>>>   byte[] theBytes = myTestString.getBytes("UTF-8");
> >> >> >>>>>
> >> >> >>>>>   and then print out theBytes in hex; they should
look like the
> >> >> above.
> >> >> >>> If
> >> >> >>>>> not,
> >> >> >>>>> please let us know what the values is instead.
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> Thanks. -Marshall
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote:
> >> >> >>>>>> Hi i was read your explication and saw the
link, but in my
> >> >> >>>>>> case,
> >> i
> >> >> >>>>>> don't read any xml file. Just i copy the text,
get a new input
> >> cas
> >> >> >>>>>> from UimaAsynchronousEngine with getCAS(),
set the text in the
> >> cas
> >> >> >>>>>> and
> >> >> >>>>>> send the request whit sendCAS(). I use uima-as
API 2.9.0 in
> the
> >> >> >>>>>> client
> >> >> >>>>>> side. Apparently the characters are changed
for its entities
> >> >> >>>>>> corresponding when serialize the cas to send
it, but i get the
> >> >> >>>>>> mentioned exception "org.xml.sax.SAXParseException;
> lineNumber:
> >> 1;
> >> >> >>>>>> columnNumber: 571; Character reference "&#"
> >> >> >>>>>> in uima-as framework installed when trying
to deserialize the
> >> >> >>>>>> cas
> >> >> >>>>>> deserializeCasFromXmi(),to be processed for
the service.
> >> >> >>>>>>
> >> >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor
<msa@schor.com>:
> >> >> >>>>>>> Hi Nelson,
> >> >> >>>>>>>
> >> >> >>>>>>> I can't see the characters (sorry).
> >> >> >>>>>>>
> >> >> >>>>>>> This might be an issue caused by a discrepancy
between the
> >> coding
> >> >> of
> >> >> >>> the
> >> >> >>>>>>> file
> >> >> >>>>>>> being read, and the coding indicated on
the xml header.  Can
> >> >> >>>>>>> you
> >> >> >>>>>>> check
> >> >> >>>>>>> that
> >> >> >>>>>>> those two things are the same?
> >> >> >>>>>>>
> >> >> >>>>>>> See
> >> >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is-
> >> >> >>> the-encoding-in-the-xml-header
> >> >> >>>>>>> for example.
> >> >> >>>>>>>
> >> >> >>>>>>> -Marshall
> >> >> >>>>>>>
> >> >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
> >> >> >>>>>>>> i tried to proccess the following
text in a service deploy
> in
> >> >> >>> uima-as,
> >> >> >>>>>>>> because is input of my application.
This is the text : 𖦀
> 𖦐
> >> �
> >> >> >>>>>>>> �.
> >> >> >>>>>>>> These characters correspond to the
bamun language, and
> >> >> >>>>>>>> apparently
> >> >> >>>>>>>> are
> >> >> >>>>>>>> not  invalid xml characters because
tools such as browsers
> >> >> >>>>>>>> interpret
> >> >> >>>>>>>> it and show it. After get a new input
cas to proccesing, set
> >> the
> >> >> >>>>>>>> text
> >> >> >>>>>>>> and send the request, i get  the exception
that i show below
> >> >> >>>>>>>> in
> >> >> >>>>>>>> uima-as, the framework uima-as work
and recovers correctly,
> >> just
> >> >> >>>>>>>> not
> >> >> >>>>>>>> process this characters.
> >> >> >>>>>>>> Could you tell me what happens with
these characters, one of
> >> >> >>>>>>>> these
> >> >> >>>>>>>> is
> >> >> >>>>>>>> invalid characters for framework uima-as?
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>> 04:00:31.606 - 14:
> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.
> >> >> >>> handleProcessRequestFromRemoteClient:
> >> >> >>>>>>>> WARNING:
> >> >> >>>>>>>> org.xml.sax.SAXParseException; lineNumber:
1; columnNumber:
> >> 571;
> >> >> >>>>>>>> Character reference "&#
> >> >> >>>>>>>>         at
> >> >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers.
> >> >> AbstractSAXParser.parse(
> >> >> >>> AbstractSAXParser.java:1239)
> >> >> >>>>>>>>         at
> >> >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
> >> >> >>> UimaSerializer.java:187)
> >> >> >>>>>>>>         at
> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.
> >> >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_
> >> >> impl.java:222)
> >> >> >>>>>>>>         at
> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.
> >> >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
> >> >> impl.java:552)
> >> >> >>>>>>>>         at
> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
> >> >> impl.handle(
> >> >> >>> ProcessRequestHandler_impl.java:1090)
> >> >> >>>>>>>>         at
> >> >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
> >> >> >>> impl.handle(MetadataRequestHandler_impl.java:78)
> >> >> >>>>>>>>         at
> >> >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> >> >> >>> onMessage(JmsInputChannel.java:731)
> >> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message