uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: Designing collection readers: Reading multiple XML files containing multiple CASes
Date Mon, 07 Oct 2013 14:05:55 GMT
In the readers of the DKPro Core collection, in most cases, a reader is responsible for a particular
format, not for a kind of data source (e.g. an URI). If a format has multiple documents in
the same file, then we extract a part of the data, fill the CAS, but keep the stream to that
file open so that the next time we can continue where we left off. 

We tend to handle the data source abstraction via Spring resource resolvers. If we want to
read from some place other that file system or classpath, then we can plug an alternative
resolver into a reader, e.g. for HDFS or CIFS file systems.


-- Richard

On 07.10.2013, at 15:59, Thilo Goetz <twgoetz@gmx.de> wrote:

> I just want to point out that there is an alternative.  I never use collection readers
and cas consumers myself.  Instead, I do the reading of the input and the aggregation of the
output outside the framework, where I have more control over things.  Just my opinion though.
> See http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.application.using_aes
> on how to do that.
> --Thilo
> On 10/07/2013 03:19 AM, swirl wrote:
>> Hi,
>> I am wondering if anyone has a better idea.
>> Requirement:
>> a. I have a pipeline that needs to process a bunch of XML files.
>> b. The XML files could be on the disk, or from a remote location (available
>> via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
>> c. Each XML file contain mulitple sections, each section's content should be
>> parsed to produce a separate CAS
>> d. I need to able to parse XML of different schema. Although the assumption
>> is that each pipeline run can only handle one specific XML schema. That is, I
>> do not need to handle different XML schema in each pipeline run.
>> e. With the above, I need to be able to construct a new collection reader,
>> parser based on specific needs of each application.
>> f. For e.g., I can specify that the XML files are in a disk folder, and to
>> use parser A to decode the specific schema of the XML files. In another
>> pipeline, I can specify to the collection reader a list of URLs to retrieve
>> some remote XML files and parse them using parser B.
>> Here are what I have so far:
>> a. I am using cleartk's UriCollectionReader to insert URIs of files into the
>> CAS from local disk folders and remote URIs. So far so good.
>> b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS
>> and parse the file according to XML schema A.
>> c. But the above only produce 1 CAS per XML file. Requirement c. is not
>> fulfilled. I need to produce multiple CASes from a single XML file. How do I
>> do this?
>> Thanks in advance.

View raw message