Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 8334 invoked from network); 22 Aug 2007 13:43:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Aug 2007 13:43:58 -0000 Received: (qmail 36177 invoked by uid 500); 22 Aug 2007 13:43:51 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 36118 invoked by uid 500); 22 Aug 2007 13:43:51 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 36093 invoked by uid 99); 22 Aug 2007 13:43:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2007 06:43:51 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mcampbell@syrres.com designates 209.2.183.11 as permitted sender) Received: from [209.2.183.11] (HELO gaius.syrres.com) (209.2.183.11) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2007 13:43:50 +0000 X-Nonspoof-Check: 20040304 Message-ID: <46CC3D80.2070101@syrres.com> Date: Wed, 22 Aug 2007 09:43:28 -0400 From: Matthew Campbell User-Agent: Thunderbird 2.0.0.0 (Windows/20070326) MIME-Version: 1.0 To: uima-user@incubator.apache.org Subject: Re: Multi-Document Processing References: <46CA075F.7090201@syrres.com> <46CAD44E.5060307@schor.com> In-Reply-To: <46CAD44E.5060307@schor.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Thanks so much! That does help - I'm still fiddling with making sure my various Sofa's are getting through alright, but this gets me in the right direction. -Matt Marshall Schor wrote: > Matthew Campbell wrote: >> Hey folks: >> >> I'm looking at a process that runs each document through a bunch >> of annotators to tag up various information, then I need to do some >> processing/manipulation of those documents based the information held >> in the whole collection. I've been reading up on the CPE, but it >> looks like it's primarily for running a collection of documents >> through an AE. I was hoping someone could point me in the right >> direction for doing the collection-wide processing portion of my >> process. >> I had started out by defining the process as one large aggregate >> AE and running each document through it, but I don't see a way to go >> through that initial tagging process for all documents and then move >> on to the next phase. >> I then switched gears and tried splitting up each phase into it's >> own AE, but then I loose the complex Sofa mappings I had put together >> for the previous attempt. So I guess this could be solved in two >> ways - one would be that the CPE has some sort of built-in method for >> doing collection-wide processing and manipulation (ie, "first >> identify all location names in all documents, then replace each with >> a new name, but make sure the new name doesn't appear in any other >> document"). The other would be to somehow run through the first >> phase to identify everything, do processing using the collection of >> JCas's resulting, then pump each JCas into a second AE for doing >> post-processing stuff. Somewhere in there would have to be some >> dynamically-mapped Sofas from the phase 1 AE to the phase 2 AE. >> >> I hope that described my goal well enough, and thanks ahead of >> time for any pointers you guys can throw my way. >> > The way many do things like this is to have a singleton Annotator at > the end of the pipe line, which sees all of the CASes being processed > after they've been "tagged" by earlier annotators. This annotator > would have some persistent Java object(s) that accumulated information > across the entire document collection, and would have a > collection-processing-complete method which it would register with the > CPM so it could be called at the end of processing the collection. > This method would then use the accumulated information to do whatever > processing you wanted to do at that point. > > Would that work? > -Marshall >