uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Grivolla <j+...@grivolla.net>
Subject Re: Views or Separate CASes?
Date Mon, 31 Aug 2015 11:30:50 GMT
Hi Matt,

As Richard said, using Views is more designed for having "parallel"
information, such as separate layers of audio, transcript, video, etc.
referring to the same content or "document".

I'm not quite sure why you want to split your document for processing
(which you could do with a CAS Multiplier). Wouldn't it be much easier to
just maintain and process it as one document, marking the different
segments with e.g. speaker information, etc.? I don't quite understand your
need for splitting, your AEs can run on all the segments (and most can be
instructed not to cross segment boundaries or only work at the sentence
level anyway).

Of course if what you want is to be able to search for and retrieve
segments that pertain to different speakers then you will need to index
your content in something like Solr outside of UIMA, and while you could
use a CAS Multiplier and then index each generated CAS as a document, it is
much easier to just have a CasConsumer that knows how to deal with your
segment annotations and extracts the information you want to index in an
appropriate form.

You may want to look at our project EUMSSI (http://eumssi.eu/) which is
about doing exactly this. You can find our initial design here:
http://www.aclweb.org/anthology/W14-5212 which we presented at the last
UIMA workshop (http://glicom.upf.edu/OIAF4HLT/) and some more documentation
on https://github.com/EUMSSI/EUMSSI-platform/wiki.

The segment indexing is not in there yet, but I expect to put something on
Github in the next one or two weeks.


On Wed, Aug 26, 2015 at 4:45 PM, Matthew DeAngelis <ronin78@gmail.com>

> Hello UIMA Gurus,
> I am relatively new to UIMA, so please excuse the general nature of my
> question and any butchering of the terminology.
> I am attempting to write an application to process transcripts of audio
> files. Each "raw" transcript is in its own HTML file with a section listing
> biographical information for the speakers on the call followed by a number
> of sections containing transcriptions of the discussion of different
> topics. I would like to be able to analyze each speaker's contributions
> separately by topic and then aggregate and compare these analyses between
> speakers and between each speaker and the full text. I was thinking that I
> would break the document into a new segment each time the speaker or the
> section of the document changes (attaching relevant speaker metadata to
> each section), run additional Analysis Engines on each segment (tokenizer,
> etc.), and then arbitrarily recombine the results of the analysis by
> speaker, etc.
> Looking through the documentation, I am considering two approaches:
> 1. Using a CAS Multiplier. Under this approach, I would follow the example
> in Chapter 7 of the documentation, divide on section and speaker
> demarcations, add metadata to each CAS, run additional AEs on the CASes,
> and then use a multiplier to recombine the many CASes for each document
> (one for the whole transcript, one for each section, one for each speaker,
> etc.). The advantage of this approach is that it seems easy to incorporate
> into a pipeline of AEs, since they are designed to run on each CAS. The
> disadvantage is that it seems unwieldy to have to keep track of all of the
> related CASes per document and aggregate statistics across the CASes.
> 2. Use CAS Views. This option is appealing because it seems like CAS Views
> were designed for associating many different aspects of the same document
> with one another. However, it looks to me that I would have to specify
> different views both when parsing the document into sections and when
> passing them through subsequent AEs, which would make it harder to drop
> into an existing pipeline. I may be misunderstanding how subsequent AEs
> work with Views, however.
> For those more experience with UIMA, how would you approach this problem?
> It's entirely possible that I am missing a third (fourth, fifth...)
> approach that would work better than either of those above, so any guidance
> would be much appreciated.
> Regards and thanks,
> Matt

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message