uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nelson rivera <nelsonriver...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Tue, 27 Dec 2016 18:39:33 GMT
After time of investigation, i found the root cause. The reason was
that i had Xalan library in my classpath and according
javax.xml.transform.newInstance() and ordered lookup procedure, uses
this Xalan implementation of  TransformerFactory. XmiCasSerializer
mechanism also use the SAXTransformerFactory that extends of
TransformerFactory.
Change to system-default implementation, specifying
"javax.xml.transform.TransformerFactory" system property, get the
expected results, the complete entities of the  unicode supplementary
characters when the input CAS is serialized, instead of the entities
of the two surrogates code units  that represents it, and of this way
not occurs any problem of deserialize in uima-as service.
At the end it seem be a bug of the XML transform engine that is used.


<cas:NULL xmi:id="0"/><tcas:DocumentAnnotation xmi:id="8" sofa="1"
begin="0" end="12" language="x-unspecified"/><cas:Sofa xmi:id="1"
sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="&#92544;
 &#92560;  �  �"/>

2016-12-19 10:03 GMT-05:00, nelson rivera <nelsonrivera12@gmail.com>:
> I understand, and yes, these characters should not appear in the
> serialized cas, but they appear using
> XmiCasSerializer.serialize(cas.getCas(), outStream):
>
> ...<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView"
> mimeType="text" sofaString="&#55322;&#56704;  &#55322;&#56720;  �
> �"/>...
>
> In my application not use FileSystemCollectionReader.
> The user introduces the text, the text is stored in string java
> (utf-16) and it set to the cas that will be processing, using
> setDocumentLanguage, then i send the cas.
>
> 2016-12-18 23:06 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>> Since these characters are above the basic UTF-16 limit they are
>> represented as 2 UTF-16 characters with high & low surrogate prefixes.
>> So
>> 55322 + 56704 are xD81A + xDD80 and after removing the 6-bit surrogate
>> prefixes of D8 & DC we have 2 10-bit numbers 1A + 180 which combine as
>> 6980, and after adding 2*16 (since only characters above this need
>> surrogate pairs) we have the expected x16980.
>> So one mystery is: their appearance in the CAS with the &# notation.
>> When
>> I dump the CAS in the FileSystemCollectionReader I see the UTF-8
>> character,
>> e.g. in hex  f096 a680 f096 a690.
>> What collection reader are you using?
>>
>> On Fri, Dec 16, 2016 at 5:45 PM, nelson rivera <nelsonrivera12@gmail.com>
>> wrote:
>>
>>> This is the cas serialize to xmi before send to uima-as service,
>>> serialize with  XmiCasSerializer.serialize(cas.getCas(), outStream).
>>> The representation of the characters In this serialization does not
>>> match with the representation of characters with problems. It's being
>>> serialized the code points escape sequences corresponding to the Bamum
>>> characters, two code point by each character.
>>> Why can this happen? Any suggestions
>>>
>>> <?xml version="1.0" encoding="UTF-8"?><xmi:XMI
>>> xmlns:cas="http:///uima/cas.ecore" xmlns:xmi="http://www.omg.org/XMI"
>>> xmlns:pln="http:///cu/datys/xinetica/uima/api/pln.ecore"
>>> xmlns:tcas="http:///uima/tcas.ecore"
>>> xmlns:api="http:///cu/datys/xinetica/uima/api.ecore"
>>> xmi:version="2.0"><cas:NULL xmi:id="0"/><tcas:DocumentAnnotation
>>> xmi:id="8" sofa="1" begin="0" end="12"
>>> language="x-unspecified"/><cas:Sofa xmi:id="1" sofaNum="1"
>>> sofaID="_InitialView" mimeType="text" sofaString="&#55322;&#56704;
>>> &#55322;&#56720;  �  �"/><cas:View sofa="1" members="8"/></xmi:XMI>
>>>
>>>
>>> 2016-12-16 14:06 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>>> > Sorry, I missed the supplement set.  So the tests I did with x16980 &
>>> > x16990 are valid.  runRemoteAsyncAE uses the same
>>> > FileSystemCollectionReader as runAE does ... did you use a different
>>> > collection reader?  If a custom one perhaps you could serialize the
>>> > cas
>>> to
>>> > a file as XMI and verify that the XMI is legal.
>>> >
>>> > On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera
>>> > <nelsonrivera12@gmail.com
>>> >
>>> > wrote:
>>> >
>>> >> In Wikipedia the Bamum
>>> >> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
>>> >> valid range is U+16800–U+16A3F, any of theses characters generate
the
>>> >> same log trace. I will continue to test the  Marshall Schor
>>> >> suggestion.
>>> >>
>>> >> 2016-12-14 18:07 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>>> >> > I think there's another problem ... the characters we have tested
>>> >> > with
>>> >> are
>>> >> > not in the Bamum unicode set.  The first 2 that Marshall listed
in
>>> >> > utf-8
>>> >> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990
and the 3rd
>>> >> > (EF
>>> >> > BF
>>> >> > BD) is xFFFD.  This last one is the "replacement character" used
>>> >> > when
>>> >> > an
>>> >> > illegal character is encountered.  According to Wikipedia the 88
>>> >> > Bamum
>>> >> > characters are in the range xA6A0 - xA6F7.
>>> >> >
>>> >> > In order to reproduce your problem we need to yse the same
>>> >> > codepoints.
>>> >> Can
>>> >> > you tell us what the hex value of the failing characters are, in
>>> >> > UTF-8
>>> >> > or
>>> >> > UTF-!6?
>>> >> >
>>> >> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE,
>>> >> > not
>>> >> runAE,
>>> >> > following the quick test described in the UIMA-AS README.
>>> >> >
>>> >> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <msa@schor.com>
>>> wrote:
>>> >> >
>>> >> >> Maybe we've been on the wrong line of thinking.
>>> >> >>
>>> >> >> Perhaps the translation between UTF-8 (during transportation)
and
>>> >> >> the
>>> >> >> string
>>> >> >> characters is fine, but the XML parsing is restricting the
>>> >> >> character
>>> >> >> set
>>> >> >> it uses.
>>> >> >>
>>> >> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
>>> >> >>
>>> >> >> where it says valid xml characters exclude the "surrogates",
which
>>> >> >> your
>>> >> >> characters I think are.
>>> >> >>
>>> >> >> So, perhaps it's XML parsing which is complaining (and it appears
>>> this
>>> >> is
>>> >> >> so,
>>> >> >> from the stack trace).
>>> >> >>
>>> >> >> We should point out that UIMA's character offsets (like begin
an
>>> >> >> end)
>>> >> >> were
>>> >> >> designed with Java String character offsets, and will perhaps
not
>>> work
>>> >> >> correctly
>>> >> >> when surrogates are being used.
>>> >> >>
>>> >> >> A possible workaround for this particular issue may be to switch
>>> >> >> to
>>> >> >> binary
>>> >> >> serialization, instead of xmi serialization. This has a
>>> >> >> restriction
>>> in
>>> >> >> that the
>>> >> >> type systems much be identical (between the client and server).
>>> >> >>
>>> >> >> We could possibly get more confirmation of this hypothesis
if you
>>> >> >> could
>>> >> >> say what
>>> >> >> the stack trace was, beyond the first bit which you stated
in your
>>> >> >> original
>>> >> >> note.  There should be more stack trace information, further
down,
>>> >> >> starting with
>>> >> >> "caused by ..." which may provide more helpful information.
>>> >> >>
>>> >> >> -Marshall
>>> >> >>
>>> >> >>
>>> >> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
>>> >> >> > We also did that test with uima framework and RunAE tool
and
>>> >> >> > thecharacters in a file as you, and effectively not exist
>>> >> >> > problem.
>>> >> >> > The
>>> >> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine
>>> >> >> > and
>>> >> >> > when trying to deserialize the cas deserializeCasFromXmi()
in
>>> remote
>>> >> >> > uima-as service, that  i get the mentioned exception
>>> >> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber:
>>> >> >> > 571;
>>> >> >> > Character reference "&#"
>>> >> >> >
>>> >> >> > In my case i don't read any file, not use
>>> >> >> > FileSystemCollectionReader.
>>> >> >> > The user introduces the text, the text is stored in string
java
>>> >> >> > (utf-16) and it set to the cas that will be processing,
using
>>> >> >> > setDocumentLanguage, then i send the cas.
>>> >> >> >
>>> >> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
>>> >> >> >> I put these 3 characters as UTF-8 in a file in examples/data
>>> >> >> >> and
>>> >> >> >> ran
>>> >> >> >> the
>>> >> >> >> MeetingDetector annotator as described in section
3.4 of the
>>> >> >> >> README,
>>> >> >> adding
>>> >> >> >> the option "-o out".  In that folder I found the returned
>>> >> >> >> results
>>> >> >> >> in
>>> >> >> >> xmi
>>> >> >> >> format with the characters in the sofaString element.
 The
>>> relevant
>>> >> >> part of
>>> >> >> >> this file in hex is:
>>> >> >> >>
>>> >> >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*
>>> >> >> >> tring=".........
>>> >> >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56
>>> >> >> >> ..&#10;"/><cas:V
>>> >> >> >>
>>> >> >> >> Note that the FileSystemCollectionReader by default
uses the
>>> system
>>> >> >> >> encoding but you could add a ConfigurationParameterSetting
of
>>> UTF-8
>>> >> >> >> for
>>> >> >> the
>>> >> >> >> Encoding parameter in its descriptor.
>>> >> >> >>
>>> >> >> >> With the client & server on different (Linux)
machines I see no
>>> >> >> >> problem
>>> >> >> >> with sending UTF-8 characters.
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <msa@schor.com>
>>> >> wrote:
>>> >> >> >>
>>> >> >> >>> another question:  I assume there are perhaps
2 machines
>>> involved,
>>> >> >> >>> here
>>> >> >> >>> (it's a
>>> >> >> >>> UIMA-AS setup).
>>> >> >> >>>
>>> >> >> >>> From the exception, it appears that the error
happen when the
>>> >> >> >>> client
>>> >> >> >>> sends
>>> >> >> >>> the
>>> >> >> >>> CAS to the remote.
>>> >> >> >>>
>>> >> >> >>> Can you print out the Linux (assuming that's the
OS) default
>>> >> >> >>> locale
>>> >> >> >>> for
>>> >> >> >>> both
>>> >> >> >>> machines?  (e.g. type into a command line "locale"
and see
>>> >> >> >>> what
>>> >> >> >>> each
>>> >> >> >>> machines
>>> >> >> >>> has as its default character encoding).
>>> >> >> >>>
>>> >> >> >>> Please let us know what these are.
>>> >> >> >>>
>>> >> >> >>> Thanks. -Marshall
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
>>> >> >> >>>> Yes these are the values of the troublesome
characters, using
>>> >> >> >>>> Integer.toHexString() to print out each byte,
shows
>>> >> >> >>>>
>>> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80
>>> >> >> >>>>
>>> >> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90
>>> >> >> >>>>
>>> >> >> >>>> ffffffef ffffffbf ffffffbd
>>> >> >> >>>>
>>> >> >> >>>> ffffffef ffffffbf ffffffbd
>>> >> >> >>>>
>>> >> >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor
<msa@schor.com>:
>>> >> >> >>>>> Hi Nelson,
>>> >> >> >>>>>
>>> >> >> >>>>> Looking into this... Can you please confirm
that the UTF-8
>>> >> >> >>>>> coding
>>> >> >> >>>>> of
>>> >> >> >>>>> the
>>> >> >> >>>>> troublesome characters, in hexadecimal,
is:
>>> >> >> >>>>>
>>> >> >> >>>>> F0 96 A6 80
>>> >> >> >>>>>
>>> >> >> >>>>> F0 96 A6 90
>>> >> >> >>>>>
>>> >> >> >>>>> EF BF BD
>>> >> >> >>>>>
>>> >> >> >>>>> EF BF BD
>>> >> >> >>>>>
>>> >> >> >>>>> If you have the string in Java, please
try converting it to
>>> >> >> >>>>> a
>>> >> UTF-8
>>> >> >> >>> string
>>> >> >> >>>>> using
>>> >> >> >>>>> something like:
>>> >> >> >>>>>   byte[] theBytes = myTestString.getBytes("UTF-8");
>>> >> >> >>>>>
>>> >> >> >>>>>   and then print out theBytes in hex;
they should look like
>>> >> >> >>>>> the
>>> >> >> above.
>>> >> >> >>> If
>>> >> >> >>>>> not,
>>> >> >> >>>>> please let us know what the values is
instead.
>>> >> >> >>>>>
>>> >> >> >>>>>
>>> >> >> >>>>> Thanks. -Marshall
>>> >> >> >>>>>
>>> >> >> >>>>>
>>> >> >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote:
>>> >> >> >>>>>> Hi i was read your explication and
saw the link, but in my
>>> >> >> >>>>>> case,
>>> >> i
>>> >> >> >>>>>> don't read any xml file. Just i copy
the text, get a new
>>> >> >> >>>>>> input
>>> >> cas
>>> >> >> >>>>>> from UimaAsynchronousEngine with getCAS(),
set the text in
>>> >> >> >>>>>> the
>>> >> cas
>>> >> >> >>>>>> and
>>> >> >> >>>>>> send the request whit sendCAS(). I
use uima-as API 2.9.0 in
>>> the
>>> >> >> >>>>>> client
>>> >> >> >>>>>> side. Apparently the characters are
changed for its
>>> >> >> >>>>>> entities
>>> >> >> >>>>>> corresponding when serialize the cas
to send it, but i get
>>> >> >> >>>>>> the
>>> >> >> >>>>>> mentioned exception "org.xml.sax.SAXParseException;
>>> lineNumber:
>>> >> 1;
>>> >> >> >>>>>> columnNumber: 571; Character reference
"&#"
>>> >> >> >>>>>> in uima-as framework installed when
trying to deserialize
>>> >> >> >>>>>> the
>>> >> >> >>>>>> cas
>>> >> >> >>>>>> deserializeCasFromXmi(),to be processed
for the service.
>>> >> >> >>>>>>
>>> >> >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall
Schor <msa@schor.com>:
>>> >> >> >>>>>>> Hi Nelson,
>>> >> >> >>>>>>>
>>> >> >> >>>>>>> I can't see the characters (sorry).
>>> >> >> >>>>>>>
>>> >> >> >>>>>>> This might be an issue caused
by a discrepancy between the
>>> >> coding
>>> >> >> of
>>> >> >> >>> the
>>> >> >> >>>>>>> file
>>> >> >> >>>>>>> being read, and the coding indicated
on the xml header.
>>> >> >> >>>>>>> Can
>>> >> >> >>>>>>> you
>>> >> >> >>>>>>> check
>>> >> >> >>>>>>> that
>>> >> >> >>>>>>> those two things are the same?
>>> >> >> >>>>>>>
>>> >> >> >>>>>>> See
>>> >> >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is-
>>> >> >> >>> the-encoding-in-the-xml-header
>>> >> >> >>>>>>> for example.
>>> >> >> >>>>>>>
>>> >> >> >>>>>>> -Marshall
>>> >> >> >>>>>>>
>>> >> >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera
wrote:
>>> >> >> >>>>>>>> i tried to proccess the following
text in a service
>>> >> >> >>>>>>>> deploy
>>> in
>>> >> >> >>> uima-as,
>>> >> >> >>>>>>>> because is input of my application.
This is the text : 𖦀
>>> 𖦐
>>> >> �
>>> >> >> >>>>>>>> �.
>>> >> >> >>>>>>>> These characters correspond
to the bamun language, and
>>> >> >> >>>>>>>> apparently
>>> >> >> >>>>>>>> are
>>> >> >> >>>>>>>> not  invalid xml characters
because tools such as
>>> >> >> >>>>>>>> browsers
>>> >> >> >>>>>>>> interpret
>>> >> >> >>>>>>>> it and show it. After get
a new input cas to proccesing,
>>> >> >> >>>>>>>> set
>>> >> the
>>> >> >> >>>>>>>> text
>>> >> >> >>>>>>>> and send the request, i get
 the exception that i show
>>> >> >> >>>>>>>> below
>>> >> >> >>>>>>>> in
>>> >> >> >>>>>>>> uima-as, the framework uima-as
work and recovers
>>> >> >> >>>>>>>> correctly,
>>> >> just
>>> >> >> >>>>>>>> not
>>> >> >> >>>>>>>> process this characters.
>>> >> >> >>>>>>>> Could you tell me what happens
with these characters, one
>>> >> >> >>>>>>>> of
>>> >> >> >>>>>>>> these
>>> >> >> >>>>>>>> is
>>> >> >> >>>>>>>> invalid characters for framework
uima-as?
>>> >> >> >>>>>>>>
>>> >> >> >>>>>>>>
>>> >> >> >>>>>>>>
>>> >> >> >>>>>>>> 04:00:31.606 - 14:
>>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>>> impl.
>>> >> >> >>> handleProcessRequestFromRemoteClient:
>>> >> >> >>>>>>>> WARNING:
>>> >> >> >>>>>>>> org.xml.sax.SAXParseException;
lineNumber: 1;
>>> >> >> >>>>>>>> columnNumber:
>>> >> 571;
>>> >> >> >>>>>>>> Character reference "&#
>>> >> >> >>>>>>>>         at
>>> >> >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers.
>>> >> >> AbstractSAXParser.parse(
>>> >> >> >>> AbstractSAXParser.java:1239)
>>> >> >> >>>>>>>>         at
>>> >> >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
>>> >> >> >>> UimaSerializer.java:187)
>>> >> >> >>>>>>>>         at
>>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>>> impl.
>>> >> >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_
>>> >> >> impl.java:222)
>>> >> >> >>>>>>>>         at
>>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>>> impl.
>>> >> >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
>>> >> >> impl.java:552)
>>> >> >> >>>>>>>>         at
>>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
>>> >> >> impl.handle(
>>> >> >> >>> ProcessRequestHandler_impl.java:1090)
>>> >> >> >>>>>>>>         at
>>> >> >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
>>> >> >> >>> impl.handle(MetadataRequestHandler_impl.java:78)
>>> >> >> >>>>>>>>         at
>>> >> >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
>>> >> >> >>> onMessage(JmsInputChannel.java:731)
>>> >> >> >>>
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >
>>>
>>
>

Mime
View raw message