Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <54AFEABA.101@uni-wuerzburg.de>
Date: Fri, 09 Jan 2015 15:50:34 +0100
From: =?UTF-8?B?UGV0ZXIgS2zDvGds?= <pkluegl@uni-wuerzburg.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: user@uima.apache.org
Subject: Re: Ruta parallel execution
References: 
 <CA+gF1kXVf5bHDLnBdT4Uvit6z=LLe1Ryw4Ux_2CSBxe=vwO5uQ@mail.gmail.com>
 <5491B9B8.5010504@uni-wuerzburg.de>
 <CA+gF1kX22-gexUceoaksDu_9p99MZpimrN1+N-qcdNhxU4Foiw@mail.gmail.com>
 <CAEXKQZ1hNz8kNO0uyhnWQpxch_0qPCitzFSwt-Dknk4e=n9Gng@mail.gmail.com>
 <CA+gF1kWHyoN=oMe52qwyPrgGNmMfOPcjjZyiv5gou3vTtAHtKw@mail.gmail.com>
 <5494582E.4030805@uni-wuerzburg.de>
 <CA+gF1kV=UCrmAtzwoay-1b3CV0Qtkwn_cs2RAL5Xggqko6zCOA@mail.gmail.com>
 <54A3ECAD.9060303@uni-wuerzburg.de>
 <CA+gF1kX8tsr1wincGhQ=rfqvvgWcP9vwju4uj81JxjevHhL6Qw@mail.gmail.com>
 <54AFE960.8030900@uni-wuerzburg.de>
In-Reply-To: <54AFE960.8030900@uni-wuerzburg.de>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

As for reusing the tokenization, may we should add something like this
logic (reusing as default):

for each seeder
  if there are annotations of my seeding types
    if new config param is true // for partial or corrupt tokenizations
      remove all seeding annotations and generate them anew
    else
      do nothing
  else // no tokenization yet
    generate seeding annotations

Best,

Peter


Am 09.01.2015 um 15:44 schrieb Peter Klügl:
> Hi,
>
> Am 09.01.2015 um 15:28 schrieb Silvestre Losada:
>> Hi Peter
>>
>> I missed this email. I see your point about the analysis engines changing
>> arbitrary the annotations, however that fact can occur now, if a script
>> uses EXEC action to execute external analysis engine, I think that an extra
>> parameter could be added to ruta to specify if ruta tokenization,
>> RutaAnnotations and RutaStream can be reused. I think that it may be
>> possible to reuse ruta tokenization (annotations stream) across same Cas.
> Yes, this should be possible, or let me say it this way: the
> tokenization of one seeder should be reused at any case.  Other scripts
> may apply additional seeder, but that won't probably not be the common
> case. Reusing RutaStream will be complicated, especially for
> multi-view/cas-multiplier pipelines. I think the best way is to share
> and update the RutaBasics.
>
> There are many options to improve the performance when applying several
> analysis engines in a normal UIMA pipeline. Especially the internal
> indexing should be improved. The main reason why these improvements are
> not yet implemented can probably be found in our use cases (no parallel
> execution, applying one complex script, no need for high performance).
>
> I am open for all improvements. In my opinion, we should create a test
> pipeline as a unit test and then optimize all aspects.
>
> Best,
>
> Peter
>
>
>> Best Silvestre.
>>
>> On 31 December 2014 at 13:31, Peter Klügl <pkluegl@uni-wuerzburg.de> wrote:
>>
>>> Am 29.12.2014 um 16:24 schrieb Silvestre Losada:
>>>
>>>> Thanks for your answer, I was working in this way and seems to be best
>>>> approach. The problem here is that I need to setup several RutaEngines in
>>>> the pipe, it would be nice if RutaStream or at least ruta annotations
>>>> generated can be reused from one RutaEngine to another RutaEngine in same
>>>> pipe, to avoid duplicated information. If you wish I can implement it and
>>>> submit a patch to you.
>>>>
>>> Oh yes, this causes a real slowdown when applying several scripts within a
>>> pipeline. All help is welcome :-)
>>>
>>> The main problem is that ruta requires additional indexing information for
>>> conditions like PARTOF (which otherwise would be terribly slow). I don't
>>> think that reusing the RutaStream would help because there could be an
>>> arbitrary analysis engine changing arbitrary annotations. The RutaBasic
>>> annotations are already reused to some extend, but the indexing is done
>>> again. My first guess would be that we add another configuration parameter
>>> with a list of all types that analysis engines applied after the last ruta
>>> engine may have changed. Some helper methods could set these values
>>> automatically given a pipeline. We could also use the capabilities of the
>>> engines, but I am not sure that they are always correctly set.
>>>
>>> What do you think?
>>>
>>> Best,
>>>
>>> Peter
>>>
>>>
>>>
>>>> Kind regards.
>>>>
>>>> On 19 December 2014 at 17:54, Peter Klügl <pkluegl@uni-wuerzburg.de>
>>>> wrote:
>>>>
>>>>  Am 19.12.2014 15:10, schrieb Silvestre Losada:
>>>>>> Hi Jens,
>>>>>>
>>>>>> First of all thanks for your detailed answer. UIMA ruta has an option in
>>>>>> order to execute an analisys engine from ruta script here
>>>>>> <http://goo.gl/ekbhv8> is described. So inside the script you can
>>>>>>
>>>>> execute
>>>>>
>>>>>> the analysis engine and then apply some rules to the annotations created
>>>>>>
>>>>> by
>>>>>
>>>>>> the analysis engine. What I want is to have the option to execute the
>>>>>> analysis engines in parallel to save time. Would it be possible?
>>>>>>
>>>>> That's not possible in that way that you use more or other processes for
>>>>> the contained analysis engine than for the ruta script. The analysis
>>>>> engine and the rules can be parallelized together as one analysis engine
>>>>> namely that one of the script.
>>>>>
>>>>> You should probably extract the analysis engine into a pipeline, which
>>>>> applies the analysis engine and then the script (resp. its analysis
>>>>> engine). Then, the normal UIMA-AS setting applies.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>>  Kind regards
>>>>>> On 19 December 2014 at 12:35, Jens Grivolla <j+asf@grivolla.net> wrote:
>>>>>>
>>>>>>> Hi Silvestre,
>>>>>>>
>>>>>>> there doesn't seem to be anything RUTA-specific in your question. In
>>>>>>> principle, UIMA-AS allows parallel scaleout and merges the results
>>>>>>>
>>>>>> (though
>>>>>> I personally have never used it this way), but there are of course a few
>>>>>>> things to take into account.
>>>>>>>
>>>>>>> First, you will of course need to properly define the dependencies
>>>>>>>
>>>>>> between
>>>>>> your different analysis engines to ensure you always have all then
>>>>>>> necessary information available, meaning that you can only run things
>>>>>>> in
>>>>>>> parallel that are independent of one another. And then you will have to
>>>>>>>
>>>>>> see
>>>>>> if the overhead from distributing your CAS to several engines running in
>>>>>>> parallel and then merging the results is not greater than just having
>>>>>>>
>>>>>> it in
>>>>>> one colocated pipeline that can pass the information more efficiently. I
>>>>>>> guess you'll have to benchmark your specific application, but maybe
>>>>>>> somebody with more experience can give you some general directions...
>>>>>>>
>>>>>>> Best,
>>>>>>> Jens
>>>>>>>
>>>>>>> On Thu, Dec 18, 2014 at 12:26 PM, Silvestre Losada <
>>>>>>> silvestre.losada@gmail.com> wrote:
>>>>>>>
>>>>>>>> Well let me explain.
>>>>>>>>
>>>>>>>> Ruta scripts are really good to work over output of analysis engines,
>>>>>>>>
>>>>>>> each
>>>>>>>
>>>>>>>> analysis engine will make some atomic work and using ruta rules you
>>>>>>>> can
>>>>>>>> easily work over generated annotations combine them, remove them...
>>>>>>>>
>>>>>>> What I
>>>>>>>
>>>>>>>> need is to execute several analysis engines in parallel to improve the
>>>>>>>> response time, so now the analysis engines are executed sequentially
>>>>>>>>
>>>>>>> and
>>>>>> I
>>>>>>>> want to execute them in parallel, then take the output of all of them
>>>>>>>>
>>>>>>> and
>>>>>> apply some ruta rules to the output.
>>>>>>>> would it be possible.
>>>>>>>>
>>>>>>>> On 17 December 2014 at 18:13, Peter Klügl <pkluegl@uni-wuerzburg.de>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I haven't used UIMA-AS (with ruta) in a real application yet, but I
>>>>>>>>> tested it once for an rc. Did you face any problems?
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 17.12.2014 14:34, schrieb Silvestre Losada:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Is there any way to execute ruta scripts in parallel, using uima-AS
>>>>>>>>>>   aproach? in case yes could you provide me an example.
>>>>>>>>>>
>>>>>>>>>> Kind regards.
>>>>>>>>>>
>>>>>>>>>>