uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: Collection Readers and File Format Filtering
Date Thu, 29 May 2014 18:08:08 GMT
Hello Oli,

I know of two strategies:

1) READER+AE: use a reader to control from where the data is retrieved. The reader reads the
raw data format, e.g. a PDF file. Then a subsequent analysis engine converts the raw data
into what is actually to be processed, e.g. extracting the text from the PDF. I think that
ClearTK [1] is going into this direction nowadays.

2) READER+PLUGIN: use a reader to perform the data conversion. The reader may be configured
with a strategy that controls from where the data is obtained. DKPro Core [2] is going into
that direction. Most readers can be configured with a custom Spring ResourcePatternResolver,
e.g. to access files from a HDFS (afaik a corresponding ResourcePatternResolver is included
in Spring for Apache Hadoop [3]). I also did a proof-of-concept ResourcePatternResolver for
Samba shares once. 

I guess it boils down to whether you consider it important to have the raw data in the CAS.
Some people may see that as a benefit, others may consider it a waste of memory.

In the olden times, there was a thing called CasInitializer [4] which appears to have been
a plugin that a reader could use to extract information from the raw data and fill it into
the CAS. Sounds like approach 2) mentioned above. However, the CasInitializer has been deprecated
for quite some time now and its Javadoc says to use different views instead (sounds like approach
1). Maybe somebody else can provide some detail as to why the CasInitializer was deprecated
- I never used it, but I always thought it sounded like a quite useful concept.


-- Richard

[1] http://cleartk.googlecode.com
[2] https://code.google.com/p/dkpro-core-asl/
[3] http://projects.spring.io/spring-hadoop/
[4] http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/api/org/apache/uima/collection/CasInitializer.html

P.S.: none of the mentioned projects are ASF projects. I am affiliated with the DKPro Core

On 29.05.2014, at 15:11, Oliver Christ <ochrist@EBSCO.COM> wrote:

> Hi,
> From my (still very limited) UIMA experience it seems that collection readers address
how to retrieve documents from some location and how to import (or filter) that document into
the CAS.
> Filtering (i.e. file format-specific processing) can be seen as independent of where
the data is retrieved from. I'm wondering whether there's a "UIMA way" to separate the two
aspects, i.e. a model consisting of two components; one which abstracts storage and retrieval,
and the second addressing file format filtering.
> Thanks!
> Cheers, Oli

View raw message