uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burn Lewis <burnle...@gmail.com>
Subject Re: Proccesing Bamun characters
Date Fri, 16 Dec 2016 19:06:56 GMT
Sorry, I missed the supplement set.  So the tests I did with x16980 &
x16990 are valid.  runRemoteAsyncAE uses the same
FileSystemCollectionReader as runAE does ... did you use a different
collection reader?  If a custom one perhaps you could serialize the cas to
a file as XMI and verify that the XMI is legal.

On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera <nelsonrivera12@gmail.com>
wrote:

> In Wikipedia the Bamum
> Script(https://en.wikipedia.org/wiki/Bamum_script) contain another
> valid range is U+16800–U+16A3F, any of theses characters generate the
> same log trace. I will continue to test the  Marshall Schor
> suggestion.
>
> 2016-12-14 18:07 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> > I think there's another problem ... the characters we have tested with
> are
> > not in the Bamum unicode set.  The first 2 that Marshall listed in utf-8
> > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF BF
> > BD) is xFFFD.  This last one is the "replacement character" used when an
> > illegal character is encountered.  According to Wikipedia the 88 Bamum
> > characters are in the range xA6A0 - xA6F7.
> >
> > In order to reproduce your problem we need to yse the same codepoints.
> Can
> > you tell us what the hex value of the failing characters are, in UTF-8 or
> > UTF-!6?
> >
> > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not
> runAE,
> > following the quick test described in the UIMA-AS README.
> >
> > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor <msa@schor.com> wrote:
> >
> >> Maybe we've been on the wrong line of thinking.
> >>
> >> Perhaps the translation between UTF-8 (during transportation) and the
> >> string
> >> characters is fine, but the XML parsing is restricting the character set
> >> it uses.
> >>
> >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML
> >>
> >> where it says valid xml characters exclude the "surrogates", which your
> >> characters I think are.
> >>
> >> So, perhaps it's XML parsing which is complaining (and it appears this
> is
> >> so,
> >> from the stack trace).
> >>
> >> We should point out that UIMA's character offsets (like begin an end)
> >> were
> >> designed with Java String character offsets, and will perhaps not work
> >> correctly
> >> when surrogates are being used.
> >>
> >> A possible workaround for this particular issue may be to switch to
> >> binary
> >> serialization, instead of xmi serialization. This has a restriction in
> >> that the
> >> type systems much be identical (between the client and server).
> >>
> >> We could possibly get more confirmation of this hypothesis if you could
> >> say what
> >> the stack trace was, beyond the first bit which you stated in your
> >> original
> >> note.  There should be more stack trace information, further down,
> >> starting with
> >> "caused by ..." which may provide more helpful information.
> >>
> >> -Marshall
> >>
> >>
> >> On 12/14/2016 9:38 AM, nelson rivera wrote:
> >> > We also did that test with uima framework and RunAE tool and
> >> > thecharacters in a file as you, and effectively not exist problem. The
> >> > problem is use uima-as,  sendCAS() with UimaAsynchronousEngine and
> >> > when trying to deserialize the cas deserializeCasFromXmi() in remote
> >> > uima-as service, that  i get the mentioned exception
> >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571;
> >> > Character reference "&#"
> >> >
> >> > In my case i don't read any file, not use FileSystemCollectionReader.
> >> > The user introduces the text, the text is stored in string java
> >> > (utf-16) and it set to the cas that will be processing, using
> >> > setDocumentLanguage, then i send the cas.
> >> >
> >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis <burnlewis@gmail.com>:
> >> >> I put these 3 characters as UTF-8 in a file in examples/data and ran
> >> >> the
> >> >> MeetingDetector annotator as described in section 3.4 of the README,
> >> adding
> >> >> the option "-o out".  In that folder I found the returned results in
> >> >> xmi
> >> >> format with the characters in the sofaString element.  The relevant
> >> part of
> >> >> this file in hex is:
> >> >>
> >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef*  tring=".........
> >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56  ..&#10;"/><cas:V
> >> >>
> >> >> Note that the FileSystemCollectionReader by default uses the system
> >> >> encoding but you could add a ConfigurationParameterSetting of UTF-8
> >> >> for
> >> the
> >> >> Encoding parameter in its descriptor.
> >> >>
> >> >> With the client & server on different (Linux) machines I see no
> >> >> problem
> >> >> with sending UTF-8 characters.
> >> >>
> >> >>
> >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor <msa@schor.com>
> wrote:
> >> >>
> >> >>> another question:  I assume there are perhaps 2 machines involved,
> >> >>> here
> >> >>> (it's a
> >> >>> UIMA-AS setup).
> >> >>>
> >> >>> From the exception, it appears that the error happen when the client
> >> >>> sends
> >> >>> the
> >> >>> CAS to the remote.
> >> >>>
> >> >>> Can you print out the Linux (assuming that's the OS) default locale
> >> >>> for
> >> >>> both
> >> >>> machines?  (e.g. type into a command line "locale" and see what
each
> >> >>> machines
> >> >>> has as its default character encoding).
> >> >>>
> >> >>> Please let us know what these are.
> >> >>>
> >> >>> Thanks. -Marshall
> >> >>>
> >> >>>
> >> >>>
> >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote:
> >> >>>> Yes these are the values of the troublesome characters, using
> >> >>>> Integer.toHexString() to print out each byte, shows
> >> >>>>
> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80
> >> >>>>
> >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90
> >> >>>>
> >> >>>> ffffffef ffffffbf ffffffbd
> >> >>>>
> >> >>>> ffffffef ffffffbf ffffffbd
> >> >>>>
> >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor <msa@schor.com>:
> >> >>>>> Hi Nelson,
> >> >>>>>
> >> >>>>> Looking into this... Can you please confirm that the UTF-8
coding
> >> >>>>> of
> >> >>>>> the
> >> >>>>> troublesome characters, in hexadecimal, is:
> >> >>>>>
> >> >>>>> F0 96 A6 80
> >> >>>>>
> >> >>>>> F0 96 A6 90
> >> >>>>>
> >> >>>>> EF BF BD
> >> >>>>>
> >> >>>>> EF BF BD
> >> >>>>>
> >> >>>>> If you have the string in Java, please try converting it
to a
> UTF-8
> >> >>> string
> >> >>>>> using
> >> >>>>> something like:
> >> >>>>>   byte[] theBytes = myTestString.getBytes("UTF-8");
> >> >>>>>
> >> >>>>>   and then print out theBytes in hex; they should look
like the
> >> above.
> >> >>> If
> >> >>>>> not,
> >> >>>>> please let us know what the values is instead.
> >> >>>>>
> >> >>>>>
> >> >>>>> Thanks. -Marshall
> >> >>>>>
> >> >>>>>
> >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote:
> >> >>>>>> Hi i was read your explication and saw the link, but
in my case,
> i
> >> >>>>>> don't read any xml file. Just i copy the text, get
a new input
> cas
> >> >>>>>> from UimaAsynchronousEngine with getCAS(), set the
text in the
> cas
> >> >>>>>> and
> >> >>>>>> send the request whit sendCAS(). I use uima-as API
2.9.0 in the
> >> >>>>>> client
> >> >>>>>> side. Apparently the characters are changed for its
entities
> >> >>>>>> corresponding when serialize the cas to send it, but
i get the
> >> >>>>>> mentioned exception "org.xml.sax.SAXParseException;
lineNumber:
> 1;
> >> >>>>>> columnNumber: 571; Character reference "&#"
> >> >>>>>> in uima-as framework installed when trying to deserialize
the cas
> >> >>>>>> deserializeCasFromXmi(),to be processed for the service.
> >> >>>>>>
> >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor <msa@schor.com>:
> >> >>>>>>> Hi Nelson,
> >> >>>>>>>
> >> >>>>>>> I can't see the characters (sorry).
> >> >>>>>>>
> >> >>>>>>> This might be an issue caused by a discrepancy
between the
> coding
> >> of
> >> >>> the
> >> >>>>>>> file
> >> >>>>>>> being read, and the coding indicated on the xml
header.  Can you
> >> >>>>>>> check
> >> >>>>>>> that
> >> >>>>>>> those two things are the same?
> >> >>>>>>>
> >> >>>>>>> See
> >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is-
> >> >>> the-encoding-in-the-xml-header
> >> >>>>>>> for example.
> >> >>>>>>>
> >> >>>>>>> -Marshall
> >> >>>>>>>
> >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote:
> >> >>>>>>>> i tried to proccess the following text in a
service deploy in
> >> >>> uima-as,
> >> >>>>>>>> because is input of my application. This is
the text : 𖦀  𖦐
> �
> >> >>>>>>>> �.
> >> >>>>>>>> These characters correspond to the bamun language,
and
> >> >>>>>>>> apparently
> >> >>>>>>>> are
> >> >>>>>>>> not  invalid xml characters because tools such
as browsers
> >> >>>>>>>> interpret
> >> >>>>>>>> it and show it. After get a new input cas to
proccesing, set
> the
> >> >>>>>>>> text
> >> >>>>>>>> and send the request, i get  the exception
that i show below in
> >> >>>>>>>> uima-as, the framework uima-as work and recovers
correctly,
> just
> >> >>>>>>>> not
> >> >>>>>>>> process this characters.
> >> >>>>>>>> Could you tell me what happens with these characters,
one of
> >> >>>>>>>> these
> >> >>>>>>>> is
> >> >>>>>>>> invalid characters for framework uima-as?
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> 04:00:31.606 - 14:
> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >> >>> handleProcessRequestFromRemoteClient:
> >> >>>>>>>> WARNING:
> >> >>>>>>>> org.xml.sax.SAXParseException; lineNumber:
1; columnNumber:
> 571;
> >> >>>>>>>> Character reference "&#
> >> >>>>>>>>         at
> >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers.
> >> AbstractSAXParser.parse(
> >> >>> AbstractSAXParser.java:1239)
> >> >>>>>>>>         at
> >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(
> >> >>> UimaSerializer.java:187)
> >> >>>>>>>>         at
> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_
> >> impl.java:222)
> >> >>>>>>>>         at
> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
> >> impl.java:552)
> >> >>>>>>>>         at
> >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_
> >> impl.handle(
> >> >>> ProcessRequestHandler_impl.java:1090)
> >> >>>>>>>>         at
> >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_
> >> >>> impl.handle(MetadataRequestHandler_impl.java:78)
> >> >>>>>>>>         at
> >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> >> >>> onMessage(JmsInputChannel.java:731)
> >> >>>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message