uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Grivolla <j+...@grivolla.net>
Subject Re: Designing collection readers: Reading multiple XML files containing multiple CASes
Date Thu, 10 Oct 2013 09:24:30 GMT
It sounds to me like it would be much easier to just have a custom 
collection reader that outputs one CAS per document (i.e. multiple CASes 
per input file), rather than having a CR that outputs one CAS per file 
(with just metadata) plus an additional AE to generate the "real" CASes 
from there.

Do you have a specific reason for not simply writing a Collection Reader 
that does what you want?


On 10/07/2013 03:19 AM, swirl wrote:
> Hi,
> I am wondering if anyone has a better idea.
> Requirement:
> a. I have a pipeline that needs to process a bunch of XML files.
> b. The XML files could be on the disk, or from a remote location (available
> via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
> c. Each XML file contain mulitple sections, each section's content should be
> parsed to produce a separate CAS
> d. I need to able to parse XML of different schema. Although the assumption
> is that each pipeline run can only handle one specific XML schema. That is, I
> do not need to handle different XML schema in each pipeline run.
> e. With the above, I need to be able to construct a new collection reader,
> parser based on specific needs of each application.
> f. For e.g., I can specify that the XML files are in a disk folder, and to
> use parser A to decode the specific schema of the XML files. In another
> pipeline, I can specify to the collection reader a list of URLs to retrieve
> some remote XML files and parse them using parser B.
> Here are what I have so far:
> a. I am using cleartk's UriCollectionReader to insert URIs of files into the
> CAS from local disk folders and remote URIs. So far so good.
> b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS
> and parse the file according to XML schema A.
> c. But the above only produce 1 CAS per XML file. Requirement c. is not
> fulfilled. I need to produce multiple CASes from a single XML file. How do I
> do this?
> Thanks in advance.

View raw message