uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <eckar...@tk.informatik.tu-darmstadt.de>
Subject Re: How to detect text sofa?
Date Mon, 04 Apr 2011 12:17:25 GMT
Hi Jörn,

> what is the suggested way to detect a text sofa?
> 
> As far as I know the suggested way of doing it is via the mime type, right?
> 
> Which options remain when the mime type is not set? Is CAS.getDocumentText != null appropriate
?

in my opinion, a non-text SofA has getDocumentText() == null - it would acquire the data as
a stream instead.
A text SofA might contain markup, which can be reflected by the mime type.

If data is acquired using a stream, the mime-type should probably be considered to decide
if the content can be rendered as text. However, the mapping between begin and end offsets
to the actual character offsets might not be discernable only from the mime-type.
For example if the stream returns HTML, but the offsets refer to a plain-text only "view".

Cheers,

Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone +49 (6151) 16-7477, fax -5455, room S2/02/E225
eckartde@tk.informatik.tu-darmstadt.de 
www.ukp.tu-darmstadt.de 
------------------------------------------------------------------- 






Mime
View raw message