Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 564D6200BDB for ; Mon, 12 Dec 2016 22:13:55 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 54B8D160B22; Mon, 12 Dec 2016 21:13:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 723A3160B1A for ; Mon, 12 Dec 2016 22:13:54 +0100 (CET) Received: (qmail 26356 invoked by uid 500); 12 Dec 2016 21:13:53 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 26345 invoked by uid 99); 12 Dec 2016 21:13:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Dec 2016 21:13:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2918F1849C6 for ; Mon, 12 Dec 2016 21:13:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 7EowIwQKcbxk for ; Mon, 12 Dec 2016 21:13:49 +0000 (UTC) Received: from gateway36.websitewelcome.com (gateway36.websitewelcome.com [192.185.184.18]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 115DA5F56F for ; Mon, 12 Dec 2016 21:13:49 +0000 (UTC) Received: from cm1.websitewelcome.com (cm.websitewelcome.com [192.185.0.102]) by gateway36.websitewelcome.com (Postfix) with ESMTP id A7DDE402631AA for ; Mon, 12 Dec 2016 15:13:48 -0600 (CST) Received: from gator3253.hostgator.com ([198.57.247.217]) by cm1.websitewelcome.com with id K9Dm1u01Z4i9tuE019DoGs; Mon, 12 Dec 2016 15:13:48 -0600 Received: from yktgi01e0-s5.watson.ibm.com ([129.34.20.19]:27374 helo=[9.2.55.25]) by gator3253.hostgator.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.87) (envelope-from ) id 1cGW3t-000JKE-VR for user@uima.apache.org; Mon, 12 Dec 2016 13:14:58 -0600 Subject: Re: Proccesing Bamun characters To: user@uima.apache.org References: <047f76bc-7ec9-68a7-12c3-bfc80785e332@schor.com> From: Marshall Schor Message-ID: <6638a569-11a9-99f8-0ee1-b5e945f756aa@schor.com> Date: Mon, 12 Dec 2016 14:15:02 -0500 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - gator3253.hostgator.com X-AntiAbuse: Original Domain - uima.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - schor.com X-BWhitelist: no X-Source-IP: 129.34.20.19 X-Exim-ID: 1cGW3t-000JKE-VR X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: yktgi01e0-s5.watson.ibm.com ([9.2.55.25]) [129.34.20.19]:27374 X-Source-Auth: msa+schor.com X-Email-Count: 1 X-Source-Cap: bWlzY2hvcjttaXNjaG9yO2dhdG9yMzI1My5ob3N0Z2F0b3IuY29t archived-at: Mon, 12 Dec 2016 21:13:55 -0000 another question: I assume there are perhaps 2 machines involved, here (it's a UIMA-AS setup). From the exception, it appears that the error happen when the client sends the CAS to the remote. Can you print out the Linux (assuming that's the OS) default locale for both machines? (e.g. type into a command line "locale" and see what each machines has as its default character encoding). Please let us know what these are. Thanks. -Marshall On 12/12/2016 1:58 PM, nelson rivera wrote: > Yes these are the values of the troublesome characters, using > Integer.toHexString() to print out each byte, shows > > fffffff0 ffffff96 ffffffa6 ffffff80 > > fffffff0 ffffff96 ffffffa6 ffffff90 > > ffffffef ffffffbf ffffffbd > > ffffffef ffffffbf ffffffbd > > 2016-12-12 11:35 GMT-05:00, Marshall Schor : >> Hi Nelson, >> >> Looking into this... Can you please confirm that the UTF-8 coding of the >> troublesome characters, in hexadecimal, is: >> >> F0 96 A6 80 >> >> F0 96 A6 90 >> >> EF BF BD >> >> EF BF BD >> >> If you have the string in Java, please try converting it to a UTF-8 string >> using >> something like: >> byte[] theBytes = myTestString.getBytes("UTF-8"); >> >> and then print out theBytes in hex; they should look like the above. If >> not, >> please let us know what the values is instead. >> >> >> Thanks. -Marshall >> >> >> On 12/9/2016 9:02 AM, nelson rivera wrote: >>> Hi i was read your explication and saw the link, but in my case, i >>> don't read any xml file. Just i copy the text, get a new input cas >>> from UimaAsynchronousEngine with getCAS(), set the text in the cas and >>> send the request whit sendCAS(). I use uima-as API 2.9.0 in the client >>> side. Apparently the characters are changed for its entities >>> corresponding when serialize the cas to send it, but i get the >>> mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1; >>> columnNumber: 571; Character reference "&#" >>> in uima-as framework installed when trying to deserialize the cas >>> deserializeCasFromXmi(),to be processed for the service. >>> >>> 2016-12-08 16:48 GMT-05:00, Marshall Schor : >>>> Hi Nelson, >>>> >>>> I can't see the characters (sorry). >>>> >>>> This might be an issue caused by a discrepancy between the coding of the >>>> file >>>> being read, and the coding indicated on the xml header. Can you check >>>> that >>>> those two things are the same? >>>> >>>> See >>>> http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header >>>> for example. >>>> >>>> -Marshall >>>> >>>> On 12/8/2016 4:20 PM, nelson rivera wrote: >>>>> i tried to proccess the following text in a service deploy in uima-as, >>>>> because is input of my application. This is the text : 𖦀 𖦐 � �. >>>>> These characters correspond to the bamun language, and apparently are >>>>> not invalid xml characters because tools such as browsers interpret >>>>> it and show it. After get a new input cas to proccesing, set the text >>>>> and send the request, i get the exception that i show below in >>>>> uima-as, the framework uima-as work and recovers correctly, just not >>>>> process this characters. >>>>> Could you tell me what happens with these characters, one of these is >>>>> invalid characters for framework uima-as? >>>>> >>>>> >>>>> >>>>> 04:00:31.606 - 14: >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient: >>>>> WARNING: >>>>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; >>>>> Character reference "&# >>>>> at >>>>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239) >>>>> at >>>>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187) >>>>> at >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222) >>>>> at >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552) >>>>> at >>>>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090) >>>>> at >>>>> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78) >>>>> at >>>>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731) >>>>> >>