uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Campbell <mcampb...@syrres.com>
Subject Re: Multi-Document Processing
Date Wed, 22 Aug 2007 13:43:28 GMT
Thanks so much!  That does help - I'm still fiddling with making sure my 
various Sofa's are getting through alright, but this gets me in the 
right direction.


Marshall Schor wrote:
> Matthew Campbell wrote:
>> Hey folks:
>>    I'm looking at a process that runs each document through a bunch 
>> of annotators to tag up various information, then I need to do some 
>> processing/manipulation of those documents based the information held 
>> in the whole collection.  I've been reading up on the CPE, but it 
>> looks like it's primarily for running a collection of documents 
>> through an AE.  I was hoping someone could point me in the right 
>> direction for doing the collection-wide processing portion of my 
>> process.
>>    I had started out by defining the process as one large aggregate 
>> AE and running each document through it, but I don't see a way to go 
>> through that initial tagging process for all documents and then move 
>> on to the next phase.
>>    I then switched gears and tried splitting up each phase into it's 
>> own AE, but then I loose the complex Sofa mappings I had put together 
>> for the previous attempt.  So I guess this could be solved in two 
>> ways - one would be that the CPE has some sort of built-in method for 
>> doing collection-wide processing and manipulation (ie, "first 
>> identify all location names in all documents, then replace each with 
>> a new name, but make sure the new name doesn't appear in any other 
>> document").  The other would be to somehow run through the first 
>> phase to identify everything, do processing using the collection of 
>> JCas's resulting, then pump each JCas into a second AE for doing 
>> post-processing stuff.  Somewhere in there would have to be some 
>> dynamically-mapped Sofas from the phase 1 AE to the phase 2 AE.
>>    I hope that described my goal well enough, and thanks ahead of 
>> time for any pointers you guys can throw my way.
> The way many do things like this is to have a singleton Annotator at 
> the end of the pipe line, which sees all of the CASes being processed 
> after they've been "tagged" by earlier annotators.  This annotator 
> would have some persistent Java object(s) that accumulated information 
> across the entire document collection, and would have a 
> collection-processing-complete method which it would register with the 
> CPM so it could be called at the end of processing the collection.  
> This method would then use the accumulated information to do whatever 
> processing you wanted to do at that point.
> Would that work?
> -Marshall

View raw message