Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8BD5510888 for ; Thu, 29 May 2014 18:08:14 +0000 (UTC) Received: (qmail 75090 invoked by uid 500); 29 May 2014 18:08:14 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 75053 invoked by uid 500); 29 May 2014 18:08:14 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 75041 invoked by uid 99); 29 May 2014 18:08:14 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 May 2014 18:08:14 +0000 Received: from localhost (HELO [10.0.1.6]) (127.0.0.1) (smtp-auth username rec, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 May 2014 18:08:14 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.2\)) Subject: Re: Collection Readers and File Format Filtering From: Richard Eckart de Castilho In-Reply-To: Date: Thu, 29 May 2014 20:08:08 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <71B8775D-3B18-47B5-A040-1849F8879819@apache.org> References: To: user@uima.apache.org X-Mailer: Apple Mail (2.1878.2) Hello Oli, I know of two strategies: 1) READER+AE: use a reader to control from where the data is retrieved. = The reader reads the raw data format, e.g. a PDF file. Then a subsequent = analysis engine converts the raw data into what is actually to be = processed, e.g. extracting the text from the PDF. I think that ClearTK = [1] is going into this direction nowadays. 2) READER+PLUGIN: use a reader to perform the data conversion. The = reader may be configured with a strategy that controls from where the = data is obtained. DKPro Core [2] is going into that direction. Most = readers can be configured with a custom Spring ResourcePatternResolver, = e.g. to access files from a HDFS (afaik a corresponding = ResourcePatternResolver is included in Spring for Apache Hadoop [3]). I = also did a proof-of-concept ResourcePatternResolver for Samba shares = once.=20 I guess it boils down to whether you consider it important to have the = raw data in the CAS. Some people may see that as a benefit, others may = consider it a waste of memory. In the olden times, there was a thing called CasInitializer [4] which = appears to have been a plugin that a reader could use to extract = information from the raw data and fill it into the CAS. Sounds like = approach 2) mentioned above. However, the CasInitializer has been = deprecated for quite some time now and its Javadoc says to use different = views instead (sounds like approach 1). Maybe somebody else can provide = some detail as to why the CasInitializer was deprecated - I never used = it, but I always thought it sounded like a quite useful concept. Cheers, -- Richard [1] http://cleartk.googlecode.com [2] https://code.google.com/p/dkpro-core-asl/ [3] http://projects.spring.io/spring-hadoop/ [4] = http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/api/org= /apache/uima/collection/CasInitializer.html P.S.: none of the mentioned projects are ASF projects. I am affiliated = with the DKPro Core project. On 29.05.2014, at 15:11, Oliver Christ wrote: > Hi, >=20 > =46rom my (still very limited) UIMA experience it seems that = collection readers address how to retrieve documents from some location = and how to import (or filter) that document into the CAS. >=20 > Filtering (i.e. file format-specific processing) can be seen as = independent of where the data is retrieved from. I'm wondering whether = there's a "UIMA way" to separate the two aspects, i.e. a model = consisting of two components; one which abstracts storage and retrieval, = and the second addressing file format filtering. >=20 > Thanks! >=20 > Cheers, Oli