Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96B5C9318 for ; Fri, 21 Oct 2011 20:17:36 +0000 (UTC) Received: (qmail 30117 invoked by uid 500); 21 Oct 2011 20:17:36 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 30069 invoked by uid 500); 21 Oct 2011 20:17:36 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 30060 invoked by uid 99); 21 Oct 2011 20:17:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Oct 2011 20:17:36 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of msa@schor.com designates 69.41.247.19 as permitted sender) Received: from [69.41.247.19] (HELO gateway01.websitewelcome.com) (69.41.247.19) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 21 Oct 2011 20:17:29 +0000 Received: (qmail 7040 invoked from network); 21 Oct 2011 20:12:51 -0000 Received: from gator74.hostgator.com (184.173.199.208) by gateway01.websitewelcome.com with SMTP; 21 Oct 2011 20:12:51 -0000 Received: from [129.34.20.19] (port=25273 helo=[9.2.34.128]) by gator74.hostgator.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1RHLWe-0007qH-BJ for user@uima.apache.org; Fri, 21 Oct 2011 15:17:08 -0500 Message-ID: <4EA1D344.2070404@schor.com> Date: Fri, 21 Oct 2011 16:17:08 -0400 From: Marshall Schor User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: UIMA-AS: non-XML char in text raises SAXParseException References: In-Reply-To: X-Enigmail-Version: 1.3.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - gator74.hostgator.com X-AntiAbuse: Original Domain - uima.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - schor.com X-BWhitelist: no X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: yktgi01e0-s5.watson.ibm.com ([9.2.34.128]) [129.34.20.19]:25273 X-Source-Auth: msa+schor.com X-Email-Count: 2 X-Source-Cap: bWlzY2hvcjttaXNjaG9yO2dhdG9yNzQuaG9zdGdhdG9yLmNvbQ== also, see the comments here: https://issues.apache.org/jira/browse/UIMA-387 On 10/21/2011 1:58 PM, Charles Bearden wrote: > I created a simple UIMA-AS pipeline comprising a collection reader and an > aggregate AE, which I ran simply like so: > > runRemoteAsyncAE.sh tcp://localhost:61616 CollectionReader \ > -d \ > -c \ > > Evidently, the content I wish to process has some non-XML characters in it, > because a certain bit of data raises an exception, the heart of which appears > to be: > > Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 > character: , 0x19 > > The complete exception is here: > > > The point in my code at which the exception enters the picture > (NoteLinesFromDBReader.java:139) is the point in the .getNext() method where I > get the next CAS: > jcas = aCAS.getJCas(); > > I don't run into this problem when I use the old-fashioned CPE, so my thinking > is that the CAS from the CR is being serialized before being put into the > queue. Is the expectation in UIMA AS that I sanitize text artifacts of non-XML > characters before the CR gets them? Or am I doing something else wrong perhaps? > > Thanks for your help, > Chuck