uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: Views or Separate CASes?
Date Wed, 26 Aug 2015 14:51:27 GMT
Hi,

I'd probably opt for approach 1. Adding provenance metadata to CASes
or maintaining such data externally is a useful thing anyway. If you
maintain such data in a database/index, you can quickly cut subcorpora
as necessary and are flexible for future use-cases that might require
differently cut subcorpora. If you also maintain certain statistics
in your database, it allows you to query/aggregate faster than if you
have to read all CASes with a certain property just to gather the statistics.
Updating the DB whenever a change is made to a CAS (or a block of changes
has been made) would be sufficient and could be handled by a dedicated
component that you place at the end of all kinds of pipelines that you
might run.

Views would seem more appropriate if you cared about having one view
for the transcription and another for the audio signal and want to 
annotate them independently / align them to each other.

Cheers,

-- Richard

On 26.08.2015, at 16:45, Matthew DeAngelis <ronin78@gmail.com> wrote:

> Hello UIMA Gurus,
> 
> I am relatively new to UIMA, so please excuse the general nature of my
> question and any butchering of the terminology.
> 
> I am attempting to write an application to process transcripts of audio
> files. Each "raw" transcript is in its own HTML file with a section listing
> biographical information for the speakers on the call followed by a number
> of sections containing transcriptions of the discussion of different
> topics. I would like to be able to analyze each speaker's contributions
> separately by topic and then aggregate and compare these analyses between
> speakers and between each speaker and the full text. I was thinking that I
> would break the document into a new segment each time the speaker or the
> section of the document changes (attaching relevant speaker metadata to
> each section), run additional Analysis Engines on each segment (tokenizer,
> etc.), and then arbitrarily recombine the results of the analysis by
> speaker, etc.
> 
> Looking through the documentation, I am considering two approaches:
> 
> 1. Using a CAS Multiplier. Under this approach, I would follow the example
> in Chapter 7 of the documentation, divide on section and speaker
> demarcations, add metadata to each CAS, run additional AEs on the CASes,
> and then use a multiplier to recombine the many CASes for each document
> (one for the whole transcript, one for each section, one for each speaker,
> etc.). The advantage of this approach is that it seems easy to incorporate
> into a pipeline of AEs, since they are designed to run on each CAS. The
> disadvantage is that it seems unwieldy to have to keep track of all of the
> related CASes per document and aggregate statistics across the CASes.
> 
> 2. Use CAS Views. This option is appealing because it seems like CAS Views
> were designed for associating many different aspects of the same document
> with one another. However, it looks to me that I would have to specify
> different views both when parsing the document into sections and when
> passing them through subsequent AEs, which would make it harder to drop
> into an existing pipeline. I may be misunderstanding how subsequent AEs
> work with Views, however.
> 
> For those more experience with UIMA, how would you approach this problem?
> It's entirely possible that I am missing a third (fourth, fifth...)
> approach that would work better than either of those above, so any guidance
> would be much appreciated.
> 
> 
> Regards and thanks,
> Matt


Mime
View raw message