Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 867CA200BD4 for ; Fri, 16 Dec 2016 20:07:10 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 850A9160B24; Fri, 16 Dec 2016 19:07:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8404E160B10 for ; Fri, 16 Dec 2016 20:07:09 +0100 (CET) Received: (qmail 58011 invoked by uid 500); 16 Dec 2016 19:07:08 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 57999 invoked by uid 99); 16 Dec 2016 19:07:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Dec 2016 19:07:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A5094C8337 for ; Fri, 16 Dec 2016 19:07:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id GcboK4Xu3G7y for ; Fri, 16 Dec 2016 19:07:05 +0000 (UTC) Received: from mail-it0-f52.google.com (mail-it0-f52.google.com [209.85.214.52]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EF1B35F370 for ; Fri, 16 Dec 2016 19:06:59 +0000 (UTC) Received: by mail-it0-f52.google.com with SMTP id c20so23702812itb.0 for ; Fri, 16 Dec 2016 11:06:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=iPPNXwShN3hNso+HOLa86Y2lNyGHw/bojGx8ZHHCQvc=; b=UAu8ENgNjngTXyDbiXlNNgAjfj7nXxd0IQu2f4CPMJXxr7MC85qTXbzrPASXhme5Wm nOmWfunQpLtXNTvsoxNaxNWNEBAR6wyKUUjixQsnorxGQq1T8hBdEvMglMWTfpJXG9b7 g/zFT+5MoXw8WDEgxAaxyrIM4kep1N+zRILZ5XbakTVMI7iUJQC1iAH6JTUE2FafpmUf GCVxl1m+AWNVYSCjAqPZOeAO/zXI0h6lVi9dIF9YBNQLn+mIJkXCFlz7ujuMm6sOBQ2D kqMUu+/CI/kNYPBDvlHSNwkzfqvCI74WWNFT5bfLMrKM7qGDmG+dOP8A8z2jB/0cnsD/ rGqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=iPPNXwShN3hNso+HOLa86Y2lNyGHw/bojGx8ZHHCQvc=; b=W/dnNtDTsNMmzORcIJWu4ZrxrgCnez7UOpNdGEDd4NZH1vSz/isWhozHW9m1/ITTOs NP6jZIyLUp63F/2hRRbIJUv/b/lTSmvpFCSX9gowmgVlDI/jlOYeziqyeRe+G2DvM0RR QxtpAQZTElhN/HsdqFWHt+rg1XOgegLpY6kXQM/uKiT2HEJvgDkddRTio1olWRg4vNvq denGHQYFoIWJZcrISeIBNOJV3D3NeCaPFI66QLWg/Ob2RGCKgMSd2q1xb1D+LsMKOBRW /gnZSNC2OsfQcs9f4xrLng5jk7tkmjnfbELwE227OMujGLcrzzNZaonX2djsR2MI5/L5 fIgA== X-Gm-Message-State: AKaTC02Pfu7Ft7tMiOA2U118iU5qNTbUigTc2CpzXBLnbXq8g8witr3hwi9ohYiEjxYNGjLhCNeESL5QYJFpOg== X-Received: by 10.36.55.202 with SMTP id r193mr4549214itr.89.1481915217018; Fri, 16 Dec 2016 11:06:57 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.224.225 with HTTP; Fri, 16 Dec 2016 11:06:56 -0800 (PST) In-Reply-To: References: <047f76bc-7ec9-68a7-12c3-bfc80785e332@schor.com> <6638a569-11a9-99f8-0ee1-b5e945f756aa@schor.com> <0a4df633-51ac-bc23-1b44-25dbc2cede59@schor.com> From: Burn Lewis Date: Fri, 16 Dec 2016 14:06:56 -0500 Message-ID: Subject: Re: Proccesing Bamun characters To: user@uima.apache.org Content-Type: multipart/alternative; boundary=001a1140c6f61907860543cb4661 archived-at: Fri, 16 Dec 2016 19:07:10 -0000 --001a1140c6f61907860543cb4661 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sorry, I missed the supplement set. So the tests I did with x16980 & x16990 are valid. runRemoteAsyncAE uses the same FileSystemCollectionReader as runAE does ... did you use a different collection reader? If a custom one perhaps you could serialize the cas to a file as XMI and verify that the XMI is legal. On Fri, Dec 16, 2016 at 8:37 AM, nelson rivera wrote: > In Wikipedia the Bamum > Script(https://en.wikipedia.org/wiki/Bamum_script) contain another > valid range is U+16800=E2=80=93U+16A3F, any of theses characters generate= the > same log trace. I will continue to test the Marshall Schor > suggestion. > > 2016-12-14 18:07 GMT-05:00, Burn Lewis : > > I think there's another problem ... the characters we have tested with > are > > not in the Bamum unicode set. The first 2 that Marshall listed in utf-= 8 > > (F0 96 A6 80 & F0 96 A6 90) are in hex x16980 & x16990 and the 3rd (EF = BF > > BD) is xFFFD. This last one is the "replacement character" used when a= n > > illegal character is encountered. According to Wikipedia the 88 Bamum > > characters are in the range xA6A0 - xA6F7. > > > > In order to reproduce your problem we need to yse the same codepoints. > Can > > you tell us what the hex value of the failing characters are, in UTF-8 = or > > UTF-!6? > > > > By the way, the test I ran was using UIMA-AS's runRemoteAsyncAE, not > runAE, > > following the quick test described in the UIMA-AS README. > > > > On Wed, Dec 14, 2016 at 4:15 PM, Marshall Schor wrote: > > > >> Maybe we've been on the wrong line of thinking. > >> > >> Perhaps the translation between UTF-8 (during transportation) and the > >> string > >> characters is fine, but the XML parsing is restricting the character s= et > >> it uses. > >> > >> See https://en.wikipedia.org/wiki/Valid_characters_in_XML > >> > >> where it says valid xml characters exclude the "surrogates", which you= r > >> characters I think are. > >> > >> So, perhaps it's XML parsing which is complaining (and it appears this > is > >> so, > >> from the stack trace). > >> > >> We should point out that UIMA's character offsets (like begin an end) > >> were > >> designed with Java String character offsets, and will perhaps not work > >> correctly > >> when surrogates are being used. > >> > >> A possible workaround for this particular issue may be to switch to > >> binary > >> serialization, instead of xmi serialization. This has a restriction in > >> that the > >> type systems much be identical (between the client and server). > >> > >> We could possibly get more confirmation of this hypothesis if you coul= d > >> say what > >> the stack trace was, beyond the first bit which you stated in your > >> original > >> note. There should be more stack trace information, further down, > >> starting with > >> "caused by ..." which may provide more helpful information. > >> > >> -Marshall > >> > >> > >> On 12/14/2016 9:38 AM, nelson rivera wrote: > >> > We also did that test with uima framework and RunAE tool and > >> > thecharacters in a file as you, and effectively not exist problem. T= he > >> > problem is use uima-as, sendCAS() with UimaAsynchronousEngine and > >> > when trying to deserialize the cas deserializeCasFromXmi() in remote > >> > uima-as service, that i get the mentioned exception > >> > "org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; > >> > Character reference "&#" > >> > > >> > In my case i don't read any file, not use FileSystemCollectionReader= . > >> > The user introduces the text, the text is stored in string java > >> > (utf-16) and it set to the cas that will be processing, using > >> > setDocumentLanguage, then i send the cas. > >> > > >> > 2016-12-13 15:10 GMT-05:00, Burn Lewis : > >> >> I put these 3 characters as UTF-8 in a file in examples/data and ra= n > >> >> the > >> >> MeetingDetector annotator as described in section 3.4 of the README= , > >> adding > >> >> the option "-o out". In that folder I found the returned results i= n > >> >> xmi > >> >> format with the characters in the sofaString element. The relevant > >> part of > >> >> this file in hex is: > >> >> > >> >> 000002e0: 7472 696e 673d 22*f0 96a6 80f0 96a6 90ef* tring=3D".....= .... > >> >> 000002f0: *bfbd* 2623 3130 3b22 2f3e 3c63 6173 3a56 .. "/> >> >> > >> >> Note that the FileSystemCollectionReader by default uses the system > >> >> encoding but you could add a ConfigurationParameterSetting of UTF-8 > >> >> for > >> the > >> >> Encoding parameter in its descriptor. > >> >> > >> >> With the client & server on different (Linux) machines I see no > >> >> problem > >> >> with sending UTF-8 characters. > >> >> > >> >> > >> >> On Mon, Dec 12, 2016 at 2:15 PM, Marshall Schor > wrote: > >> >> > >> >>> another question: I assume there are perhaps 2 machines involved, > >> >>> here > >> >>> (it's a > >> >>> UIMA-AS setup). > >> >>> > >> >>> From the exception, it appears that the error happen when the clie= nt > >> >>> sends > >> >>> the > >> >>> CAS to the remote. > >> >>> > >> >>> Can you print out the Linux (assuming that's the OS) default local= e > >> >>> for > >> >>> both > >> >>> machines? (e.g. type into a command line "locale" and see what ea= ch > >> >>> machines > >> >>> has as its default character encoding). > >> >>> > >> >>> Please let us know what these are. > >> >>> > >> >>> Thanks. -Marshall > >> >>> > >> >>> > >> >>> > >> >>> On 12/12/2016 1:58 PM, nelson rivera wrote: > >> >>>> Yes these are the values of the troublesome characters, using > >> >>>> Integer.toHexString() to print out each byte, shows > >> >>>> > >> >>>> fffffff0 ffffff96 ffffffa6 ffffff80 > >> >>>> > >> >>>> fffffff0 ffffff96 ffffffa6 ffffff90 > >> >>>> > >> >>>> ffffffef ffffffbf ffffffbd > >> >>>> > >> >>>> ffffffef ffffffbf ffffffbd > >> >>>> > >> >>>> 2016-12-12 11:35 GMT-05:00, Marshall Schor : > >> >>>>> Hi Nelson, > >> >>>>> > >> >>>>> Looking into this... Can you please confirm that the UTF-8 codin= g > >> >>>>> of > >> >>>>> the > >> >>>>> troublesome characters, in hexadecimal, is: > >> >>>>> > >> >>>>> F0 96 A6 80 > >> >>>>> > >> >>>>> F0 96 A6 90 > >> >>>>> > >> >>>>> EF BF BD > >> >>>>> > >> >>>>> EF BF BD > >> >>>>> > >> >>>>> If you have the string in Java, please try converting it to a > UTF-8 > >> >>> string > >> >>>>> using > >> >>>>> something like: > >> >>>>> byte[] theBytes =3D myTestString.getBytes("UTF-8"); > >> >>>>> > >> >>>>> and then print out theBytes in hex; they should look like the > >> above. > >> >>> If > >> >>>>> not, > >> >>>>> please let us know what the values is instead. > >> >>>>> > >> >>>>> > >> >>>>> Thanks. -Marshall > >> >>>>> > >> >>>>> > >> >>>>> On 12/9/2016 9:02 AM, nelson rivera wrote: > >> >>>>>> Hi i was read your explication and saw the link, but in my case= , > i > >> >>>>>> don't read any xml file. Just i copy the text, get a new input > cas > >> >>>>>> from UimaAsynchronousEngine with getCAS(), set the text in the > cas > >> >>>>>> and > >> >>>>>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the > >> >>>>>> client > >> >>>>>> side. Apparently the characters are changed for its entities > >> >>>>>> corresponding when serialize the cas to send it, but i get the > >> >>>>>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: > 1; > >> >>>>>> columnNumber: 571; Character reference "&#" > >> >>>>>> in uima-as framework installed when trying to deserialize the c= as > >> >>>>>> deserializeCasFromXmi(),to be processed for the service. > >> >>>>>> > >> >>>>>> 2016-12-08 16:48 GMT-05:00, Marshall Schor : > >> >>>>>>> Hi Nelson, > >> >>>>>>> > >> >>>>>>> I can't see the characters (sorry). > >> >>>>>>> > >> >>>>>>> This might be an issue caused by a discrepancy between the > coding > >> of > >> >>> the > >> >>>>>>> file > >> >>>>>>> being read, and the coding indicated on the xml header. Can y= ou > >> >>>>>>> check > >> >>>>>>> that > >> >>>>>>> those two things are the same? > >> >>>>>>> > >> >>>>>>> See > >> >>>>>>> http://stackoverflow.com/questions/5165347/what-use-is- > >> >>> the-encoding-in-the-xml-header > >> >>>>>>> for example. > >> >>>>>>> > >> >>>>>>> -Marshall > >> >>>>>>> > >> >>>>>>> On 12/8/2016 4:20 PM, nelson rivera wrote: > >> >>>>>>>> i tried to proccess the following text in a service deploy in > >> >>> uima-as, > >> >>>>>>>> because is input of my application. This is the text : =F0=96= =A6=80 =F0=96=A6=90 > =EF=BF=BD > >> >>>>>>>> =EF=BF=BD. > >> >>>>>>>> These characters correspond to the bamun language, and > >> >>>>>>>> apparently > >> >>>>>>>> are > >> >>>>>>>> not invalid xml characters because tools such as browsers > >> >>>>>>>> interpret > >> >>>>>>>> it and show it. After get a new input cas to proccesing, set > the > >> >>>>>>>> text > >> >>>>>>>> and send the request, i get the exception that i show below = in > >> >>>>>>>> uima-as, the framework uima-as work and recovers correctly, > just > >> >>>>>>>> not > >> >>>>>>>> process this characters. > >> >>>>>>>> Could you tell me what happens with these characters, one of > >> >>>>>>>> these > >> >>>>>>>> is > >> >>>>>>>> invalid characters for framework uima-as? > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> 04:00:31.606 - 14: > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > >> >>> handleProcessRequestFromRemoteClient: > >> >>>>>>>> WARNING: > >> >>>>>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: > 571; > >> >>>>>>>> Character reference "&# > >> >>>>>>>> at > >> >>>>>>>> com.sun.org.apache.xerces.internal.parsers. > >> AbstractSAXParser.parse( > >> >>> AbstractSAXParser.java:1239) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi( > >> >>> UimaSerializer.java:187) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > >> >>> deserializeCASandRegisterWithCache(ProcessRequestHandler_ > >> impl.java:222) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl. > >> >>> handleProcessRequestFromRemoteClient(ProcessRequestHandler_ > >> impl.java:552) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_ > >> impl.handle( > >> >>> ProcessRequestHandler_impl.java:1090) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_ > >> >>> impl.handle(MetadataRequestHandler_impl.java:78) > >> >>>>>>>> at > >> >>>>>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel. > >> >>> onMessage(JmsInputChannel.java:731) > >> >>> > >> > >> > > > --001a1140c6f61907860543cb4661--