uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Terrier" <olivier.terr...@temis.com>
Subject UIMA chunking
Date Mon, 21 Jul 2008 14:34:21 GMT
Hi all,
Sometimes we are facing the problem of processing collection of "big" documents.
This may leads to an instability of the processing chain: out-of-memory errors, timeouts etc...
Moreover this it not very efficient in terms of load balancing (we use CPEs with analysis
engines deployed as Vinci remote services on several machines).
We would like to solve this problem implementing a kind of UIMA document chunking where
big documents would be splitted into reasonable chunks (according to a given block size for
example) at the beginning of the processing chain and merged back into one CAS at the end.
According to us, the splitting phase is quite straightforward : a CAS multiplier
splits the input document into N text blocks and produce N CASes.
Chunking informations like:
- document identifier
- current part number
- total part number
- text offset
Are stored in the CAS.
The merging phase is much more complicated : a CAS consumer is responsible for intercepting
each "part" and store it somewhere (in memory or serialized on the filesystem), when the last
part of the document comes in, all the annotation of the CAS parts are merged back taking
into account the offset.
As we use a CPE, the merger CAS consumer can't "produce" a new CAS. What we have in mind is
to create a new Sofa "fullDocumentView" in the last CAS "part" to store the text of the full
document along with its associated annotations.
Another idea is to use sofa mappings to leave unchanged our existing CAS consumers (that are
sofa-unaware) that come after the merger in the CPE flow.
      CPE flow:
      
    CAS SPLITTER
_InitialView: text part_i
fullDocumentView: empty
          |
         AE1  
_InitialView: text part_i + annotations AE1
fullDocumentView: empty
          |
        ...
          |
         AEn
_InitialView: text partN + annotations AE1+...+AEn
fullDocumentView: empty
          |
     CAS MERGER
_InitialView: text part_i + annotations AE1+...+AEn
fullDocumentView: if not last part = empty
                  if last part = text + annotations merged part1+...+partN
          |
      CONSUMER (sofa-unaware)
MAPPING cpe sofa : fullDocumentView => component sofa : _InitialView
_InitialView: text + annotations merged part1+...+partN

The tricky operations are:
- caching/storing the CAS 'parts' in the merger: how (XCAS, XMI, etc..) ? where (memory, disk,
...)?
- the merging of CAS 'parts' annotation into the full document CAS.
- error management: what append in case of errors on some parts?
We would like to share the thoughts/opinions of the UIMA community regarding this problem
and the possible solutions.
Do you think our approach is the good one?
Does anybody has already faced a similar problem?
As far as possible we don't want to reinvent the wheele and give priority to a generic and
ideally a UIMA-builtin implementation. We are of course ready to contribute to this development
if the community find a generic solution.
Regards
Olivier Terrier - TEMIS 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message