uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Gazzo <mario.ga...@gmail.com>
Subject Re: Error handling in flow control
Date Mon, 27 Apr 2015 18:58:02 GMT
Thanks Eddie,

I think I need to look deeper into CasMultipliers and UIMA-AS but it sounds more complicated
than I hoped. I got something without CAS multipliers working now and it can get me all the
way if I initially just combine it with snapshots optimised in the new compressed binary CAS
format. I will therefore now be working on other more crucial parts to get all things wired
up first before digging deeper into this but I might get back to you about it once I have
investigated it further. Your input has been valuable and gives me something to work with.

Much appreciated,

> On 26 Apr 2015, at 18:26 , Eddie Epstein <eaepstein@gmail.com> wrote:
> Very clear, thanks. A CasMultiplier has the ability to deserialize a CAS
> from file and emit it as a child CAS. A parent CAS could have a
> FeatureStructure identifying it as one to be rerun from some specific state
> (CAS file), the CM would trigger on the FS and produce the child CAS to be
> reprocessed, the flow controller configured to return the child from the
> aggregate, and the client would then use the child and ignore the parent.
> An ideal threading solution would be to use UIMA-AS. Unfortunately a
> UIMA-AS service currently requires an AMQ broker for service input and
> output. It is possible to embed both broker and service in process, just a
> complication and with serialization overhead.
> Another thing to consider is to use the relatively new binary compressed
> CAS form 6, which can save considerable space over zip compressed XmiCas.
> Form 6 has the same ability as XmiCas to be deserialized into a CAS with
> different but compatible typesystem.
> Hope this helps,
> Eddie
> On Sat, Apr 25, 2015 at 2:58 AM, Mario Gazzo <mario.gazzo@gmail.com> wrote:
>> My apologies for not being very clear.
>> I managed to get the basic flow control to work after modifying some AE to
>> check for a previous installed sofa before just adding another.
>> The services I mentioned are not UIMA related but we are migrating
>> existing text analysis components to UIMA and these need to integrate with
>> a larger existing setup that rely on various AWS services such as S3,
>> DynamoDB, Simple Workflow and EMR. We don’t have as such plans to use
>> UIMA-AS or Vinci but instead we already use AWS Simple Workflow (SWF) to
>> orchestrate all our workers. This means that we just wanted to run multiple
>> UIMA pipelines inside some of these workers using multithreaded CPE. I am
>> now trying to implement this integration by consuming activity tasks from
>> SWF through a collection reader and then have a flow control manage the
>> logic and respond back when the AAE pipeline has completed or failed. This
>> is where I had problems when experimenting with failure handling.
>> We are storing output from these workers on S3 and in DynamoDB tables for
>> use further downstream in our workflow and online applications. We also
>> store intermediate results (snapshots) on S3 so that we can at any point go
>> back to a previous step and resume, retry or redo processing but it also
>> allows us to inspect data for debugging/analysis purposes. I thought that I
>> might be able to do something similar within the CPE using the CAS but this
>> isn't that simple. E.g. running the same AE twice against the same CAS
>> would result in those annotations occurring twice without carefully
>> designing around this. I can still serialize snapshot CAS to XMI on S3 but
>> I can’t just load them again in order to restore them back to a previous
>> state within the same CPE flow. Instead I would have to fail and initiate a
>> retry through SWF, which would cause the previous state to be loaded from
>> S3 into a new CAS via the next worker that receives the retry activity task
>> through its collection reader. However, storing many snapshot CAS outputs
>> will even compressed take a lot more space than the format we are using in
>> our production setup now, so I am considering whether there are alternative
>> approaches but they so far all appear much more complex and brittle.
>> Indeed CAS multipliers would be useful for us but the limitations of the
>> CPE and the general difficulties I have experienced so far have made me
>> consider implementing a custom multithreaded collection processor but I
>> wanted to avoid this.
>> Hope this clarifies what I am trying to do. Cheers :)
>>> On 24 Apr 2015, at 16:50 , Eddie Epstein <eaepstein@gmail.com> wrote:
>>> Can you give more details on the overall pipeline deployment? The initial
>>> description mentions a CPE and it mentions services. The CPE was created
>>> before flow controllers or CasMutipliers existed and has no support of
>>> them. Services could be Vinci services for the CPE or UIMA-AS services or
>>> ???
>>> On Fri, Apr 24, 2015 at 5:37 AM, Mario Gazzo <mario.gazzo@gmail.com>
>> wrote:
>>>> I am trying to get error handling to work with a custom flow control. I
>>>> need to send status information back to a service after the flow
>> completed
>>>> either with or without errors but I can only do this once for any
>> workflow
>>>> item because it changes the state of the job, at least without error
>>>> replies and wasteful requests. The problem is that I need to do several
>>>> retries before finally failing and reporting the status to a service.
>> First
>>>> I tried to let the CPE do the retry for me by setting the max error
>> count
>>>> but then a new flow object is created every time and I loose track of
>> the
>>>> number of retries before this. This means that I don’t know when to
>> report
>>>> the status to the service because it should only happen after the final
>>>> retry.
>>>> I then tried to let the flow instance manage the retries by moving back
>> to
>>>> the previous step again but then I get the error
>>>> “org.apache.uima.cas.CASRuntimeException: Data for Sofa feature
>>>> setLocalSofaData() has already been set”, which is because the document
>>>> text is set in this particular test case. I then also tried to reset the
>>>> CAS completely before retrying the pipeline from scratch and this of
>> course
>>>> throws the error “CASAdminException: Can't flush CAS, flushing is
>>>> disabled.”. It would be less wasteful if only the failed step is retried
>>>> instead of the whole pipeline but this requires clean up, which in some
>>>> cases might be impossible. It appears that managing errors can be rather
>>>> complex because the CAS can be in an unknown state and an analysis
>> engine
>>>> operation is not idempotent. I probably need to start the whole pipeline
>>>> from the start if I want more than a single attempt, which gets me back
>> to
>>>> the problem of tracking the number of attempts before reporting back to
>> the
>>>> service.
>>>> Does anyone have any good suggestion on how to do this in UIMA e.g.
>>>> passing state information from a failed flow to the next flow attempt?

View raw message