From Matthew DeAngelis <roni...@gmail.com>
Subject Re: Views or Separate CASes?
Date Tue, 01 Sep 2015 12:43:25 GMT
Hi Jens,

Thank you very much for your thoughtful input. At the time of writing my
question, I was not aware how easy it was (using uimaFIT functionality) to
query for Annotations within other Annotations. As such, it seemed to make
sense to make hard boundaries between sections. I also only wanted to
tokenize and perform word counts on certain sections in the document,
without the HTML markup, which required me to clean within sections as well.

Ultimately, however, I decided to do something much like you suggest,
except that I created a separate view to hold the extracted and "cleaned"
text and Annotations of different sections, tokens, words, etc. My CAS
consumer has no difficulty extracting the sections that I need (I have to
loop to get every section for a particular speaker, but that's not much of
an issue) and working with Annotations within those sections. In other
words, your suggested approach works very well, and I appreciate you
sharing it with me.

Your project looks very interesting, I will keep a lookout for updates.


On Mon, Aug 31, 2015 at 7:30 AM, Jens Grivolla <j+asf@grivolla.net> wrote:

> Hi Matt,
> As Richard said, using Views is more designed for having "parallel"
> information, such as separate layers of audio, transcript, video, etc.
> referring to the same content or "document".
> I'm not quite sure why you want to split your document for processing
> (which you could do with a CAS Multiplier). Wouldn't it be much easier to
> just maintain and process it as one document, marking the different
> segments with e.g. speaker information, etc.? I don't quite understand your
> need for splitting, your AEs can run on all the segments (and most can be
> instructed not to cross segment boundaries or only work at the sentence
> level anyway).
> Of course if what you want is to be able to search for and retrieve
> segments that pertain to different speakers then you will need to index
> your content in something like Solr outside of UIMA, and while you could
> use a CAS Multiplier and then index each generated CAS as a document, it is
> much easier to just have a CasConsumer that knows how to deal with your
> segment annotations and extracts the information you want to index in an
> appropriate form.
> You may want to look at our project EUMSSI (http://eumssi.eu/) which is
> about doing exactly this. You can find our initial design here:
> http://www.aclweb.org/anthology/W14-5212 which we presented at the last
> UIMA workshop (http://glicom.upf.edu/OIAF4HLT/) and some more
> documentation
> on https://github.com/EUMSSI/EUMSSI-platform/wiki.
> The segment indexing is not in there yet, but I expect to put something on
> Github in the next one or two weeks.
> Best,
> Jens
> On Wed, Aug 26, 2015 at 4:45 PM, Matthew DeAngelis <ronin78@gmail.com>
> wrote:
> > Hello UIMA Gurus,
> >
> > I am relatively new to UIMA, so please excuse the general nature of my
> > question and any butchering of the terminology.
> >
> > I am attempting to write an application to process transcripts of audio
> > files. Each "raw" transcript is in its own HTML file with a section
> listing
> > biographical information for the speakers on the call followed by a
> number
> > of sections containing transcriptions of the discussion of different
> > topics. I would like to be able to analyze each speaker's contributions
> > separately by topic and then aggregate and compare these analyses between
> > speakers and between each speaker and the full text. I was thinking that
> I
> > would break the document into a new segment each time the speaker or the
> > section of the document changes (attaching relevant speaker metadata to
> > each section), run additional Analysis Engines on each segment
> (tokenizer,
> > etc.), and then arbitrarily recombine the results of the analysis by
> > speaker, etc.
> >
> > Looking through the documentation, I am considering two approaches:
> >
> > 1. Using a CAS Multiplier. Under this approach, I would follow the
> example
> > in Chapter 7 of the documentation, divide on section and speaker
> > demarcations, add metadata to each CAS, run additional AEs on the CASes,
> > and then use a multiplier to recombine the many CASes for each document
> > (one for the whole transcript, one for each section, one for each
> speaker,
> > etc.). The advantage of this approach is that it seems easy to
> incorporate
> > into a pipeline of AEs, since they are designed to run on each CAS. The
> > disadvantage is that it seems unwieldy to have to keep track of all of
> the
> > related CASes per document and aggregate statistics across the CASes.
> >
> > 2. Use CAS Views. This option is appealing because it seems like CAS
> Views
> > were designed for associating many different aspects of the same document
> > with one another. However, it looks to me that I would have to specify
> > different views both when parsing the document into sections and when
> > passing them through subsequent AEs, which would make it harder to drop
> > into an existing pipeline. I may be misunderstanding how subsequent AEs
> > work with Views, however.
> >
> > For those more experience with UIMA, how would you approach this problem?
> > It's entirely possible that I am missing a third (fourth, fifth...)
> > approach that would work better than either of those above, so any
> guidance
> > would be much appreciated.
> >
> >
> > Regards and thanks,
> > Matt
> >

