uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From holmberg2066@comcast.net (g...@holmberg.name)
Subject Re: UIMA chunking
Date Tue, 22 Jul 2008 01:15:11 GMT
Olivier--


I can't comment on the mechanics of CAS merging that you outline below, but two thoughts occur
to me.

1. What's the motivation for merging?  For example, if one is going to put the data into a
system whose purpose is retrieving documents (index into a full-text index or insert into
a database), then the user may not even want the entire document back as a result.  In other
words, examine the assumption that the unit of retrieval should be the entire document.  It
may be more useful to return some natural sub-unit, such as a chapter or section.  If the
user gets back something huge, then he just another searching task to find the information
somewhere in the 500 pages returned.

As a side benefit, the linguistic analysis may do a better job limited to a natural sub-unit,
since they are usually more conceptually constrained.  A chapter is about one thing, an entire
book is about many things.  Also some document-level analyses can get out of hand with large
documents, such as entity co-reference resolution if it's an O(n^2) algorithm.

If the goal is not document retrieval, but text mining for "facts" and so on, then the document
boundary doesn't matter at all, and again merging isn't necessary.  The user just wants the
information, and the document boundary isn't even visible.

In short, it's hard for me to imagine a use case where merging results from a huge document
would even be desireable.

I also think that merging just delays the memory problem.  In many cases, annotations for
parts of speech, named entities, etc. use several times the memory of the document itself.
 So although this may be less memory than that needed while the annotators are running, you're
still going to come to a document size that can't be handled.  And it may not be much larger
than the document size that you currently can't handle.

So, I think it's both desirable and necessary to split the document on natural boundaries
as it streams into the process, and then just view each segment as a separate document.  These
natural boundaries make sense to me, the arbitrarily-sized chunking not so much.


2. If you really need to merge the results, then I would look for a way to incrementally add
the pieces to the repository, rather than try to get it all back together in memory.  For
example, each segment could update the full-text index, or insert more records in a database,
related to the same document ID.  So the repository accumulates results on disk for the document,
but the results are never all together in RAM.

Alternatively, move to a 64-bit CPU/OS/JVM with many gigabytes of RAM installed, and process
the document as usual (no chunking).  Buying that hardware might be less expensive than the
labor involved in making chunking work.  You can buy a quad-core server with 8 GB RAM for
$1000 (check out the Dell PowerEdge T105).  How much is your time worth?


Greg Holmberg


 -------------- Original message ----------------------
From: "Olivier Terrier" <olivier.terrier@temis.com>
> Hi all,
> Sometimes we are facing the problem of processing collection of "big" documents.
> This may leads to an instability of the processing chain: out-of-memory errors, 
> timeouts etc...
> Moreover this it not very efficient in terms of load balancing (we use CPEs with 
> analysis engines deployed as Vinci remote services on several machines).
> We would like to solve this problem implementing a kind of UIMA document 
> chunking where
> big documents would be splitted into reasonable chunks (according to a given 
> block size for example) at the beginning of the processing chain and merged back 
> into one CAS at the end.
> According to us, the splitting phase is quite straightforward : a CAS multiplier
> splits the input document into N text blocks and produce N CASes.
> Chunking informations like:
> - document identifier
> - current part number
> - total part number
> - text offset
> Are stored in the CAS.
> The merging phase is much more complicated : a CAS consumer is responsible for 
> intercepting each "part" and store it somewhere (in memory or serialized on the 
> filesystem), when the last part of the document comes in, all the annotation of 
> the CAS parts are merged back taking into account the offset.
> As we use a CPE, the merger CAS consumer can't "produce" a new CAS. What we have 
> in mind is to create a new Sofa "fullDocumentView" in the last CAS "part" to 
> store the text of the full document along with its associated annotations.
> Another idea is to use sofa mappings to leave unchanged our existing CAS 
> consumers (that are sofa-unaware) that come after the merger in the CPE flow.
>       CPE flow:
>       
>     CAS SPLITTER
> _InitialView: text part_i
> fullDocumentView: empty
>           |
>          AE1  
> _InitialView: text part_i + annotations AE1
> fullDocumentView: empty
>           |
>         ...
>           |
>          AEn
> _InitialView: text partN + annotations AE1+...+AEn
> fullDocumentView: empty
>           |
>      CAS MERGER
> _InitialView: text part_i + annotations AE1+...+AEn
> fullDocumentView: if not last part = empty
>                   if last part = text + annotations merged part1+...+partN
>           |
>       CONSUMER (sofa-unaware)
> MAPPING cpe sofa : fullDocumentView => component sofa : _InitialView
> _InitialView: text + annotations merged part1+...+partN
> 
> The tricky operations are:
> - caching/storing the CAS 'parts' in the merger: how (XCAS, XMI, etc..) ? where 
> (memory, disk, ...)?
> - the merging of CAS 'parts' annotation into the full document CAS.
> - error management: what append in case of errors on some parts?
> We would like to share the thoughts/opinions of the UIMA community regarding 
> this problem and the possible solutions.
> Do you think our approach is the good one?
> Does anybody has already faced a similar problem?
> As far as possible we don't want to reinvent the wheele and give priority to a 
> generic and ideally a UIMA-builtin implementation. We are of course ready to 
> contribute to this development if the community find a generic solution.
> Regards
> Olivier Terrier - TEMIS 
> 


Mime
View raw message