uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eddie Epstein <eaepst...@gmail.com>
Subject Re: UIMA-AS and CasManager.defineCasPool() was called twice by the same Analysis Engine
Date Sat, 20 Jun 2009 18:02:47 GMT
Hi Jörn,

Please see comments below..

On Fri, Jun 19, 2009 at 8:44 PM, Jörn Kottmann<kottmann@gmail.com> wrote:
> Thanks for your reply Jaroslaw, it seems that I misunderstood
> the way UIMA AS works.
>
>> 1)
>> "... Because the AAE is not thread safe uima as must scale it through
>> creating multiple instances of it..."
>>
>> Since the AAE is not thread safe you should not try to scale it out in the
>> same JVM. If AAE
>> is not thread safe, you should only have one instance of it per JVM. You
>> can
>> scale it by
>> starting multiple JVMs.
>>
>
> I reduced my AAE to three delegate AEs:
>
> 1. HBaseCasMultiplier -> fetches the actual text from hbase
> 2. Tokenizer -> adds tokens to my CAS
> 3. HBaseWrite -> writes the tokens back into hbase
>
> These delegates are not thread safe, to scale these AEs
> one instance per worker thread must be created.
> Thats what I want UIMA AS to do for me, so I think thats
> also the case which is described in the documentation in 1.4.1:
>
> "... The classes for annotators and flow controllers do not need to be
> "thread-safe"
> with respect to their instance data - meaning, they do not need to be
> implemented
> with synchronization locks for access to their instance data, because each
> instance
> will only be called using one thread at a time. Scale out for these classes
> is done using
> multiple instances of the class. ..."
>

That documentation is correct, but apparently not as clear as we'd like.

Note that the following paragraph in the documentation goes on to say
 "However, if you have class "static" fields shared by all instances,
  or other kinds of external data shared by all instances (such as a
  writable file), you must be aware of the possibility of multiple threads
  accessing these fields or external resources, running on separate
  instances of the class, and do any required synchronization for these."

So, barring any static fields or resources that would cause problems with
multiple instantiations, UIMA AS scaleout in the same JVM should work.

>> 2)
>> "...I must admit the documentation confused me a bit about the meaning of
>> the async attribute..."
>>
>> The async attribute is only used for aggregates, and specifies that this
>> aggregate will be run asynchronously (with input queues in front of all of
>> its delegates) or not. If you choose async="false" it means that you want
>> to
>> deploy the aggregate synchronously. Meaning it will be single-threaded. To
>> UIMA AS a synchronous aggregate is the same as a
>> UIMA primitive AE.
>>
>
> Thanks, understood the difference, so I want async="true"
>
>> 3)            ...
>>            <analysisEngine key="TextAnalysis" async="false">
>>                <scaleout numberOfInstances="8" />
>>
>>                <delegates>
>>                    <analysisEngine key="HBaseCasMultiplier">
>>                        <casMultiplier poolSize="8"/>
>>                    </analysisEngine>
>>                </delegates>
>>            </analysisEngine>
>>            ...
>>
>> The above is an inconsistent configuration.  You are specifying that
>> "TextAnalytics" should be deployed synchronously but then adding delegate
>> configuration, which forces the aggregate to be deployed asynchronously.
>> Synchronous aggregate delegate's are not "visible" to the uima-as, and
>> cannot be configured in the deployment descriptor.
>>
>

Hmm, not clear to me that you want async=true. Assuming that your
AE runs correctly as a single threaded aggregate, creating multiple
instances of this seems fine. The correction to your previous deployment
descriptor would just be:

          <analysisEngine key="TextAnalysis" async="false">
              <scaleout numberOfInstances="8" />
          </analysisEngine>

>From UIMA AS point of view, this component is not a CasMultiplier
because [I assume] it comsumes new CASes internally and does not
return them.

Let emphasize that before AS scaleout the aggregate should be tested
as a simple UIMA aggregate with the normal tools like CVD, runAE,
or a custom driver.

> Ok, I changed it to fit to case described above:
>           <analysisEngine>
>               <delegates>
>                   <analysisEngine key="HBaseCasMultiplier">
>                       <casMultiplier poolSize="4"/>
>                       <scaleout numberOfInstances="2" />
>                   </analysisEngine>
>                   <analysisEngine key="Tokenizer">
>                       <scaleout numberOfInstances="4" />
>                   </analysisEngine>
>                   <analysisEngine key="HBaseWriter">
>                       <scaleout numberOfInstances="4" />
>                   </analysisEngine>
>               </delegates>
>           </analysisEngine>
>
> I would like to scale the HBaseCasMultiplier to more threads
> then two, because there is a short delay when reading from hbase.
> First I am not sure which value I should choose for the
> Cas Multiplier pool size. If the numberOfInstances get larger
> then two I get a few exceptions (stack trace below) when UIMA AS
> starts to process the first documents. So I think I am doing something
> wrong here. And what is the minimal possible casPoolSize, since
> I need CAS instances for my 4 Tokenizers, 4 HBaseWriters
> and 4 (?) for the CAS Multiplier, which would result in a minimum
> size of 12, right ?
>
> The HBaseCasMultiplier gets one CAS which contains the id and
> then outputs one CAS which contains an actual text.
>

Supporting the complexities raised by Cas multipliers has been quite
challenging. I'm pretty sure that a co-located CM cannot be scaled; we
need to check this and clarify the situation. (This is different from having
more than one CM in the same aggregate, which is supported with
the latest code.)

Here is a possible workaround to run this aggregate asynchronously.
If I understand your scenario, each input Cas is tiny, the CM creates
a new Cas with the document to be processed and consumed by
HBaseWriters, and finally the aggregate returns just the tiny input Cas.
The workaround is to have the CM create a new Cas, but not fetch
the document. Add a new delegate immediately following the CM,
say CorpusReader, which fills the new CASes with documents and can
be scaled out as desired.

Regards,
Eddie

Mime
View raw message