Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: uima-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of mcampbell@syrres.com
 designates 209.2.183.11 as permitted sender)
Message-ID: <46CC3D80.2070101@syrres.com>
Date: Wed, 22 Aug 2007 09:43:28 -0400
From: Matthew Campbell <mcampbell@syrres.com>
User-Agent: Thunderbird 2.0.0.0 (Windows/20070326)
MIME-Version: 1.0
To: uima-user@incubator.apache.org
Subject: Re: Multi-Document Processing
References: <46CA075F.7090201@syrres.com> <46CAD44E.5060307@schor.com>
In-Reply-To: <46CAD44E.5060307@schor.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Thanks so much!  That does help - I'm still fiddling with making sure my 
various Sofa's are getting through alright, but this gets me in the 
right direction.


-Matt

Marshall Schor wrote:
> Matthew Campbell wrote:
>> Hey folks:
>>
>>    I'm looking at a process that runs each document through a bunch 
>> of annotators to tag up various information, then I need to do some 
>> processing/manipulation of those documents based the information held 
>> in the whole collection.  I've been reading up on the CPE, but it 
>> looks like it's primarily for running a collection of documents 
>> through an AE.  I was hoping someone could point me in the right 
>> direction for doing the collection-wide processing portion of my 
>> process.
>>    I had started out by defining the process as one large aggregate 
>> AE and running each document through it, but I don't see a way to go 
>> through that initial tagging process for all documents and then move 
>> on to the next phase.
>>    I then switched gears and tried splitting up each phase into it's 
>> own AE, but then I loose the complex Sofa mappings I had put together 
>> for the previous attempt.  So I guess this could be solved in two 
>> ways - one would be that the CPE has some sort of built-in method for 
>> doing collection-wide processing and manipulation (ie, "first 
>> identify all location names in all documents, then replace each with 
>> a new name, but make sure the new name doesn't appear in any other 
>> document").  The other would be to somehow run through the first 
>> phase to identify everything, do processing using the collection of 
>> JCas's resulting, then pump each JCas into a second AE for doing 
>> post-processing stuff.  Somewhere in there would have to be some 
>> dynamically-mapped Sofas from the phase 1 AE to the phase 2 AE.
>>
>>    I hope that described my goal well enough, and thanks ahead of 
>> time for any pointers you guys can throw my way.
>>
> The way many do things like this is to have a singleton Annotator at 
> the end of the pipe line, which sees all of the CASes being processed 
> after they've been "tagged" by earlier annotators.  This annotator 
> would have some persistent Java object(s) that accumulated information 
> across the entire document collection, and would have a 
> collection-processing-complete method which it would register with the 
> CPM so it could be called at the end of processing the collection.  
> This method would then use the accumulated information to do whatever 
> processing you wanted to do at that point.
>
> Would that work?
> -Marshall
>