uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: Using the Cas to compare documents
Date Thu, 25 Jun 2009 11:01:26 GMT
Radwen ANIBA wrote:
> Hi everyone,
> 
> Following some examples applications of UIMA allow us to understand how
> every component in UIMA framework works. That great. But one question that a
> developper may ask is how to use the CAS to make a comparison of analyzed
> documents.
> 
> The CAS is common to everydocument and when analzing one of them we have an
> acces to the CAS for writing or updating.
> Let's imagine We have 3 documents to analyze. We write to the CAS metadata
> relative to each of them, but to go futher for the analysis of the documents
> it could be very interesting to compare these documents using the CAS,
> either in multiple manner or in pairwise.
> 
> To illustrate what i'm saying, let's imagine we are looking for email
> adresses inside three big documents using UIMA regexp capabilities.
> A result may be illustrated like this :
> 
> Document 1 :  Number of Unique emails 9 | Number of emails in common with
> Document 2 : 10 | Number of emails in common with Document 3 : 6
> Document 2 :  Number of Unique emails 5| Number of emails in common with
> Document 1 : 20 | Number of emails in common with Document 3 : 1
> Document 3 :  Number of Unique emails 4 | Number of emails in common with
> Document 1 : 15 | Number of emails in common with Document 2 : 3
> 
> Here is a simple cross comparison of documents in pairwise using the CAS, My
> question is how to achieve that ?
> Do we need to create additional Type System for the common information ? We
> have to do it on the fly dynamically ?
> 
> Thanks
> 
> Rad
> 

Hi Rad,

using the CAS to do this will get expensive very quickly.  You will
not want to keep every document in its own CAS because of the memory
overhead.  I would probably write the information you're interested
in to an external datastore (e.g., a DB such as Derby) and do the
comparison there.

--Thilo

Mime
View raw message