Return-Path: Delivered-To: apmail-uima-user-archive@www.apache.org Received: (qmail 10671 invoked from network); 4 Apr 2011 12:18:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Apr 2011 12:18:06 -0000 Received: (qmail 25859 invoked by uid 500); 4 Apr 2011 12:18:06 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 25826 invoked by uid 500); 4 Apr 2011 12:18:05 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 25818 invoked by uid 99); 4 Apr 2011 12:18:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Apr 2011 12:18:05 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [130.83.156.225] (HELO lnx500.hrz.tu-darmstadt.de) (130.83.156.225) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Apr 2011 12:17:57 +0000 Received: from pandora.tk.informatik.tu-darmstadt.de (pandora.tk.informatik.tu-darmstadt.de [130.83.163.131]) by lnx500.hrz.tu-darmstadt.de (8.14.4/8.14.4/HRZ/PMX) with ESMTP id p34CHQAV007946 for ; Mon, 4 Apr 2011 14:17:26 +0200 (envelope-from eckartde@tk.informatik.tu-darmstadt.de) Received: from highfire.tk.informatik.tu-darmstadt.de (130.83.163.215) by pandora.tk.informatik.tu-darmstadt.de (130.83.163.131) with Microsoft SMTP Server (TLS) id 8.2.176.0; Mon, 4 Apr 2011 14:17:25 +0200 Subject: Re: How to detect text sofa? MIME-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset="iso-8859-1" From: Richard Eckart de Castilho In-Reply-To: <4D99A309.5080304@gmail.com> Date: Mon, 4 Apr 2011 14:17:25 +0200 Content-Transfer-Encoding: quoted-printable Message-ID: References: <4D99A309.5080304@gmail.com> To: "user@uima.apache.org" X-Mailer: Apple Mail (2.1084) X-PMX-TU: seen v1.2 by 5.6.1.2065439, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2011.4.4.120926 X-PMX-RELAY: outgoing X-Virus-Checked: Checked by ClamAV on apache.org Hi J=F6rn, > what is the suggested way to detect a text sofa? >=20 > As far as I know the suggested way of doing it is via the mime type, = right? >=20 > Which options remain when the mime type is not set? Is = CAS.getDocumentText !=3D null appropriate ? in my opinion, a non-text SofA has getDocumentText() =3D=3D null - it = would acquire the data as a stream instead. A text SofA might contain markup, which can be reflected by the mime = type. If data is acquired using a stream, the mime-type should probably be = considered to decide if the content can be rendered as text. However, = the mapping between begin and end offsets to the actual character = offsets might not be discernable only from the mime-type. For example if the stream returns HTML, but the offsets refer to a = plain-text only "view". Cheers, Richard --=20 -------------------------------------------------------------------=20 Richard Eckart de Castilho Technical Lead Ubiquitous Knowledge Processing Lab=20 FB 20 Computer Science Department =20 Technische Universit=E4t Darmstadt=20 Hochschulstr. 10, D-64289 Darmstadt, Germany=20 phone +49 (6151) 16-7477, fax -5455, room S2/02/E225 eckartde@tk.informatik.tu-darmstadt.de=20 www.ukp.tu-darmstadt.de=20 -------------------------------------------------------------------=20