From user-return-7333-apmail-uima-user-archive=uima.apache.org@uima.apache.org Mon Dec 12 16:36:54 2016 Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6CC0C19009 for ; Mon, 12 Dec 2016 16:36:54 +0000 (UTC) Received: (qmail 84195 invoked by uid 500); 12 Dec 2016 16:36:54 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 83863 invoked by uid 500); 12 Dec 2016 16:36:54 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 83842 invoked by uid 99); 12 Dec 2016 16:36:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Dec 2016 16:36:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 231571AA225 for ; Mon, 12 Dec 2016 16:36:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id PdxYIiHOQ4u1 for ; Mon, 12 Dec 2016 16:36:51 +0000 (UTC) Received: from gateway24.websitewelcome.com (gateway24.websitewelcome.com [192.185.50.71]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id BC82E5F5F8 for ; Mon, 12 Dec 2016 16:36:51 +0000 (UTC) Received: from cm6.websitewelcome.com (cm6.websitewelcome.com [108.167.139.19]) by gateway24.websitewelcome.com (Postfix) with ESMTP id EEA658641B39 for ; Mon, 12 Dec 2016 10:35:51 -0600 (CST) Received: from gator3253.hostgator.com ([198.57.247.217]) by cm6.websitewelcome.com with id K4bp1u00X4i9tuE014bqgm; Mon, 12 Dec 2016 10:35:50 -0600 Received: from yktgi01e0-s5.watson.ibm.com ([129.34.20.19]:62116 helo=[9.2.55.25]) by gator3253.hostgator.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.87) (envelope-from ) id 1cGTZs-000TvQ-Vt for user@uima.apache.org; Mon, 12 Dec 2016 10:35:49 -0600 Subject: Re: Proccesing Bamun characters To: user@uima.apache.org References: <047f76bc-7ec9-68a7-12c3-bfc80785e332@schor.com> From: Marshall Schor Message-ID: Date: Mon, 12 Dec 2016 11:35:52 -0500 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - gator3253.hostgator.com X-AntiAbuse: Original Domain - uima.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - schor.com X-BWhitelist: no X-Source-IP: 129.34.20.19 X-Exim-ID: 1cGTZs-000TvQ-Vt X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: yktgi01e0-s5.watson.ibm.com ([9.2.55.25]) [129.34.20.19]:62116 X-Source-Auth: msa+schor.com X-Email-Count: 5 X-Source-Cap: bWlzY2hvcjttaXNjaG9yO2dhdG9yMzI1My5ob3N0Z2F0b3IuY29t Hi Nelson, Looking into this... Can you please confirm that the UTF-8 coding of the troublesome characters, in hexadecimal, is: F0 96 A6 80 F0 96 A6 90 EF BF BD EF BF BD If you have the string in Java, please try converting it to a UTF-8 string using something like: byte[] theBytes = myTestString.getBytes("UTF-8"); and then print out theBytes in hex; they should look like the above. If not, please let us know what the values is instead. Thanks. -Marshall On 12/9/2016 9:02 AM, nelson rivera wrote: > Hi i was read your explication and saw the link, but in my case, i > don't read any xml file. Just i copy the text, get a new input cas > from UimaAsynchronousEngine with getCAS(), set the text in the cas and > send the request whit sendCAS(). I use uima-as API 2.9.0 in the client > side. Apparently the characters are changed for its entities > corresponding when serialize the cas to send it, but i get the > mentioned exception "org.xml.sax.SAXParseException; lineNumber: 1; > columnNumber: 571; Character reference "&#" > in uima-as framework installed when trying to deserialize the cas > deserializeCasFromXmi(),to be processed for the service. > > 2016-12-08 16:48 GMT-05:00, Marshall Schor : >> Hi Nelson, >> >> I can't see the characters (sorry). >> >> This might be an issue caused by a discrepancy between the coding of the >> file >> being read, and the coding indicated on the xml header. Can you check that >> those two things are the same? >> >> See >> http://stackoverflow.com/questions/5165347/what-use-is-the-encoding-in-the-xml-header >> for example. >> >> -Marshall >> >> On 12/8/2016 4:20 PM, nelson rivera wrote: >>> i tried to proccess the following text in a service deploy in uima-as, >>> because is input of my application. This is the text : 𖦀 𖦐 � �. >>> These characters correspond to the bamun language, and apparently are >>> not invalid xml characters because tools such as browsers interpret >>> it and show it. After get a new input cas to proccesing, set the text >>> and send the request, i get the exception that i show below in >>> uima-as, the framework uima-as work and recovers correctly, just not >>> process this characters. >>> Could you tell me what happens with these characters, one of these is >>> invalid characters for framework uima-as? >>> >>> >>> >>> 04:00:31.606 - 14: >>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient: >>> WARNING: >>> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 571; >>> Character reference "&# >>> at >>> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239) >>> at >>> org.apache.uima.aae.UimaSerializer.deserializeCasFromXmi(UimaSerializer.java:187) >>> at >>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:222) >>> at >>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:552) >>> at >>> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1090) >>> at >>> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78) >>> at >>> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:731) >>> >>