uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radwen ANIBA <arad...@gmail.com>
Subject Re: Using the Cas to compare documents
Date Thu, 25 Jun 2009 11:05:51 GMT
Thank you Thilo,

Well, I will investigate this idea.

Regards

Rad

2009/6/25 Thilo Goetz <twgoetz@gmx.de>

> Radwen ANIBA wrote:
> > Hi everyone,
> >
> > Following some examples applications of UIMA allow us to understand how
> > every component in UIMA framework works. That great. But one question
> that a
> > developper may ask is how to use the CAS to make a comparison of analyzed
> > documents.
> >
> > The CAS is common to everydocument and when analzing one of them we have
> an
> > acces to the CAS for writing or updating.
> > Let's imagine We have 3 documents to analyze. We write to the CAS
> metadata
> > relative to each of them, but to go futher for the analysis of the
> documents
> > it could be very interesting to compare these documents using the CAS,
> > either in multiple manner or in pairwise.
> >
> > To illustrate what i'm saying, let's imagine we are looking for email
> > adresses inside three big documents using UIMA regexp capabilities.
> > A result may be illustrated like this :
> >
> > Document 1 :  Number of Unique emails 9 | Number of emails in common with
> > Document 2 : 10 | Number of emails in common with Document 3 : 6
> > Document 2 :  Number of Unique emails 5| Number of emails in common with
> > Document 1 : 20 | Number of emails in common with Document 3 : 1
> > Document 3 :  Number of Unique emails 4 | Number of emails in common with
> > Document 1 : 15 | Number of emails in common with Document 2 : 3
> >
> > Here is a simple cross comparison of documents in pairwise using the CAS,
> My
> > question is how to achieve that ?
> > Do we need to create additional Type System for the common information ?
> We
> > have to do it on the fly dynamically ?
> >
> > Thanks
> >
> > Rad
> >
>
> Hi Rad,
>
> using the CAS to do this will get expensive very quickly.  You will
> not want to keep every document in its own CAS because of the memory
> overhead.  I would probably write the information you're interested
> in to an external datastore (e.g., a DB such as Derby) and do the
> comparison there.
>
> --Thilo
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message