Hi Jörn,
> what is the suggested way to detect a text sofa?
>
> As far as I know the suggested way of doing it is via the mime type, right?
>
> Which options remain when the mime type is not set? Is CAS.getDocumentText != null appropriate
?
in my opinion, a non-text SofA has getDocumentText() == null - it would acquire the data as
a stream instead.
A text SofA might contain markup, which can be reflected by the mime type.
If data is acquired using a stream, the mime-type should probably be considered to decide
if the content can be rendered as text. However, the mapping between begin and end offsets
to the actual character offsets might not be discernable only from the mime-type.
For example if the stream returns HTML, but the offsets refer to a plain-text only "view".
Cheers,
Richard
--
-------------------------------------------------------------------
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab
FB 20 Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone +49 (6151) 16-7477, fax -5455, room S2/02/E225
eckartde@tk.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de
-------------------------------------------------------------------
|