uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Gazzo <mario.ga...@gmail.com>
Subject Re: Error handling in flow control
Date Sat, 25 Apr 2015 06:58:23 GMT
My apologies for not being very clear.

I managed to get the basic flow control to work after modifying some AE to check for a previous
installed sofa before just adding another.

The services I mentioned are not UIMA related but we are migrating existing text analysis
components to UIMA and these need to integrate with a larger existing setup that rely on various
AWS services such as S3, DynamoDB, Simple Workflow and EMR. We don’t have as such plans
to use UIMA-AS or Vinci but instead we already use AWS Simple Workflow (SWF) to orchestrate
all our workers. This means that we just wanted to run multiple UIMA pipelines inside some
of these workers using multithreaded CPE. I am now trying to implement this integration by
consuming activity tasks from SWF through a collection reader and then have a flow control
manage the logic and respond back when the AAE pipeline has completed or failed. This is where
I had problems when experimenting with failure handling.

We are storing output from these workers on S3 and in DynamoDB tables for use further downstream
in our workflow and online applications. We also store intermediate results (snapshots) on
S3 so that we can at any point go back to a previous step and resume, retry or redo processing
but it also allows us to inspect data for debugging/analysis purposes. I thought that I might
be able to do something similar within the CPE using the CAS but this isn't that simple. E.g.
running the same AE twice against the same CAS would result in those annotations occurring
twice without carefully designing around this. I can still serialize snapshot CAS to XMI on
S3 but I can’t just load them again in order to restore them back to a previous state within
the same CPE flow. Instead I would have to fail and initiate a retry through SWF, which would
cause the previous state to be loaded from S3 into a new CAS via the next worker that receives
the retry activity task through its collection reader. However, storing many snapshot CAS
outputs will even compressed take a lot more space than the format we are using in our production
setup now, so I am considering whether there are alternative approaches but they so far all
appear much more complex and brittle.

Indeed CAS multipliers would be useful for us but the limitations of the CPE and the general
difficulties I have experienced so far have made me consider implementing a custom multithreaded
collection processor but I wanted to avoid this.

Hope this clarifies what I am trying to do. Cheers :)

> On 24 Apr 2015, at 16:50 , Eddie Epstein <eaepstein@gmail.com> wrote:
> Can you give more details on the overall pipeline deployment? The initial
> description mentions a CPE and it mentions services. The CPE was created
> before flow controllers or CasMutipliers existed and has no support of
> them. Services could be Vinci services for the CPE or UIMA-AS services or
> ???
> On Fri, Apr 24, 2015 at 5:37 AM, Mario Gazzo <mario.gazzo@gmail.com> wrote:
>> I am trying to get error handling to work with a custom flow control. I
>> need to send status information back to a service after the flow completed
>> either with or without errors but I can only do this once for any workflow
>> item because it changes the state of the job, at least without error
>> replies and wasteful requests. The problem is that I need to do several
>> retries before finally failing and reporting the status to a service. First
>> I tried to let the CPE do the retry for me by setting the max error count
>> but then a new flow object is created every time and I loose track of the
>> number of retries before this. This means that I don’t know when to report
>> the status to the service because it should only happen after the final
>> retry.
>> I then tried to let the flow instance manage the retries by moving back to
>> the previous step again but then I get the error
>> “org.apache.uima.cas.CASRuntimeException: Data for Sofa feature
>> setLocalSofaData() has already been set”, which is because the document
>> text is set in this particular test case. I then also tried to reset the
>> CAS completely before retrying the pipeline from scratch and this of course
>> throws the error “CASAdminException: Can't flush CAS, flushing is
>> disabled.”. It would be less wasteful if only the failed step is retried
>> instead of the whole pipeline but this requires clean up, which in some
>> cases might be impossible. It appears that managing errors can be rather
>> complex because the CAS can be in an unknown state and an analysis engine
>> operation is not idempotent. I probably need to start the whole pipeline
>> from the start if I want more than a single attempt, which gets me back to
>> the problem of tracking the number of attempts before reporting back to the
>> service.
>> Does anyone have any good suggestion on how to do this in UIMA e.g.
>> passing state information from a failed flow to the next flow attempt?

View raw message