uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eddie Epstein <eaepst...@gmail.com>
Subject Re: CAS merger/multiplier N:M mapping
Date Sun, 06 Sep 2015 14:58:44 GMT
Hi Petr

On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis <pasky@ucw.cz> wrote:

>   Hi!
>
>   I'm currently struggling to perform a complex flow transformation with
> UIMA.  I have multiple (N) CASes with some fulltext search results.
> I chop these search results to sentences and would like to pick the top
> M sentences from the search results collected and build CASes from them
> to do further analysis.  So, I'd like to copy subsets (document text
> wise and annotation wise) of N input CASes to M output CASes.  I don't
> know how to do this technically.  I tried two non-workable ideas so far:
>
>   (i) Keep around references to the respective views of input CASes
> and use them as CasCopier sources when the time comes to produce
> the new CASes.  Turns out the input CASes are (unsurprisingly) recycled
> and the references I kept around at process() time aren't valid when
> next() is called much later.
>
>   (ii) Use an internal "intermediary" CAS instance in process() to which
> I append my sentences, then use it as a source of output CASes.  Turns
> out (surprisingly) that I can't append to a sofa documenttext ("Data for
> Sofa feature setLocalSofaData() has already been set." - not sure about
> the reason for this restriction).
>

The Sofa data for a view is immutable, otherwise existing annotations
could become invalid.


>
>   I think the only choice except downright unmaintainable hacks (like
> programatically generated M views) is to just give up on preserving my
> annotations and carry over just the sentence texts.  Am I missing
> something?
>

Creating a new view in the intermediate CAS for each of the N input CASes
would work. A new output CAS Sofa would be comprised of data from
multiple views and of course the annotation end points adjusted as when
added to the new output CAS.

One problem there is that the intermediate CAS would continue to grow
in size, so there would need to be some point when it could be reset.


>
>   (I'm somewhat tempted to cut my losses short (much too late) and
> abandon UIMA flow control altogether, using only simple pipelines and
> having custom glue code to connect these together, as it seems like
> getting the flow to work in interesting cases is a huge time sink and in
> retrospect, it could never pay off any abstract advantage of easier
> distributed processing (where you probably end up having to chop up the
> pipeline manually anyway).  I would probably never recommend new UIMA
> users to strive for a single pipeline with CAS multipliers/mergers and
> begin to consider these features an evolutionary dead end rather than
> advantageous.  Not sure if there even *are* any other real users using
> advanced flows besides me and DeepQA.  I'll be glad to hear any opinions
> on this!)
>
>
Definitely the advantage to encapsulating analytics in standard UIMA
components is easy scalability via the vertical and horizontal scale out
options offered by UIMA-AS and DUCC. Flexibility in chopping up a
pipeline into services as needed is another advantage.

The previously mentioned GALE multimodal application also converted
sequences of N input CASes to M output CASes. In that case the input
CASes represented 2 minutes worth of speech-to-text transcription of
broadcast news, and each output CAS represented a single news story.
The story-CASes then went thru a pipeline that identified the story and
updated a pre-existing summarization for each story.

Eddie

--
>                                 Petr Baudis
>         If you have good ideas, good data and fast computers,
>         you can do almost anything. -- Geoffrey Hinton
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message