uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew DeAngelis <roni...@gmail.com>
Subject Views or Separate CASes?
Date Wed, 26 Aug 2015 14:45:45 GMT
Hello UIMA Gurus,

I am relatively new to UIMA, so please excuse the general nature of my
question and any butchering of the terminology.

I am attempting to write an application to process transcripts of audio
files. Each "raw" transcript is in its own HTML file with a section listing
biographical information for the speakers on the call followed by a number
of sections containing transcriptions of the discussion of different
topics. I would like to be able to analyze each speaker's contributions
separately by topic and then aggregate and compare these analyses between
speakers and between each speaker and the full text. I was thinking that I
would break the document into a new segment each time the speaker or the
section of the document changes (attaching relevant speaker metadata to
each section), run additional Analysis Engines on each segment (tokenizer,
etc.), and then arbitrarily recombine the results of the analysis by
speaker, etc.

Looking through the documentation, I am considering two approaches:

1. Using a CAS Multiplier. Under this approach, I would follow the example
in Chapter 7 of the documentation, divide on section and speaker
demarcations, add metadata to each CAS, run additional AEs on the CASes,
and then use a multiplier to recombine the many CASes for each document
(one for the whole transcript, one for each section, one for each speaker,
etc.). The advantage of this approach is that it seems easy to incorporate
into a pipeline of AEs, since they are designed to run on each CAS. The
disadvantage is that it seems unwieldy to have to keep track of all of the
related CASes per document and aggregate statistics across the CASes.

2. Use CAS Views. This option is appealing because it seems like CAS Views
were designed for associating many different aspects of the same document
with one another. However, it looks to me that I would have to specify
different views both when parsing the document into sections and when
passing them through subsequent AEs, which would make it harder to drop
into an existing pipeline. I may be misunderstanding how subsequent AEs
work with Views, however.

For those more experience with UIMA, how would you approach this problem?
It's entirely possible that I am missing a third (fourth, fifth...)
approach that would work better than either of those above, so any guidance
would be much appreciated.

Regards and thanks,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message