uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Designing collection readers: Reading multiple XML files containing multiple CASes
Date Mon, 07 Oct 2013 02:13:23 GMT
For part c:

I imagine an algorithm that can scan the main XML file and find the "sections". 
For each section it finds, it can produce a CAS and initialize that CAS with the
section's information.

If this algorithm lives inside an analysis component, then it can use the "CAS
Multiplier" to produce the additional CASes, one for each segment.

See
http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.cm

Is that what you're looking for, or is that off-base?

-Marshall

On 10/6/2013 9:19 PM, swirl wrote:
> Hi,
> I am wondering if anyone has a better idea.
>
> Requirement:
> a. I have a pipeline that needs to process a bunch of XML files.
> b. The XML files could be on the disk, or from a remote location (available 
> via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
> c. Each XML file contain mulitple sections, each section's content should be 
> parsed to produce a separate CAS
> d. I need to able to parse XML of different schema. Although the assumption 
> is that each pipeline run can only handle one specific XML schema. That is, I 
> do not need to handle different XML schema in each pipeline run.
> e. With the above, I need to be able to construct a new collection reader, 
> parser based on specific needs of each application.
> f. For e.g., I can specify that the XML files are in a disk folder, and to 
> use parser A to decode the specific schema of the XML files. In another 
> pipeline, I can specify to the collection reader a list of URLs to retrieve 
> some remote XML files and parse them using parser B.
>
> Here are what I have so far:
> a. I am using cleartk's UriCollectionReader to insert URIs of files into the 
> CAS from local disk folders and remote URIs. So far so good.
> b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS 
> and parse the file according to XML schema A. 
> c. But the above only produce 1 CAS per XML file. Requirement c. is not 
> fulfilled. I need to produce multiple CASes from a single XML file. How do I 
> do this?
>
> Thanks in advance.
>
>
>


Mime
View raw message