Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6565B101CA for ; Mon, 7 Oct 2013 01:20:06 +0000 (UTC) Received: (qmail 36635 invoked by uid 500); 7 Oct 2013 01:20:06 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 36554 invoked by uid 500); 7 Oct 2013 01:20:05 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 36546 invoked by uid 99); 7 Oct 2013 01:20:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Oct 2013 01:20:05 +0000 X-ASF-Spam-Status: No, hits=4.7 required=5.0 tests=FORGED_YAHOO_RCVD,FSL_HELO_BARE_IP_2,RCVD_IN_DNSWL_NONE,RCVD_NUMERIC_HELO,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gcaug-uima-user@m.gmane.org designates 80.91.229.3 as permitted sender) Received: from [80.91.229.3] (HELO plane.gmane.org) (80.91.229.3) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Oct 2013 01:19:58 +0000 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1VSzTy-0007BZ-0Q for user@uima.apache.org; Mon, 07 Oct 2013 03:19:36 +0200 Received: from 192.122.131.37 ([192.122.131.37]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Oct 2013 03:19:34 +0200 Received: from swirlobt by 192.122.131.37 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Oct 2013 03:19:34 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: user@uima.apache.org From: swirl Subject: Designing collection readers: Reading multiple XML files containing multiple CASes Date: Mon, 7 Oct 2013 01:19:12 +0000 (UTC) Lines: 30 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: sea.gmane.org User-Agent: Loom/3.14 (http://gmane.org/) X-Loom-IP: 192.122.131.37 (Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36) X-Virus-Checked: Checked by ClamAV on apache.org Hi, I am wondering if anyone has a better idea. Requirement: a. I have a pipeline that needs to process a bunch of XML files. b. The XML files could be on the disk, or from a remote location (available via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml) c. Each XML file contain mulitple sections, each section's content should be parsed to produce a separate CAS d. I need to able to parse XML of different schema. Although the assumption is that each pipeline run can only handle one specific XML schema. That is, I do not need to handle different XML schema in each pipeline run. e. With the above, I need to be able to construct a new collection reader, parser based on specific needs of each application. f. For e.g., I can specify that the XML files are in a disk folder, and to use parser A to decode the specific schema of the XML files. In another pipeline, I can specify to the collection reader a list of URLs to retrieve some remote XML files and parse them using parser B. Here are what I have so far: a. I am using cleartk's UriCollectionReader to insert URIs of files into the CAS from local disk folders and remote URIs. So far so good. b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS and parse the file according to XML schema A. c. But the above only produce 1 CAS per XML file. Requirement c. is not fulfilled. I need to produce multiple CASes from a single XML file. How do I do this? Thanks in advance.