Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E03DA17594 for ; Mon, 27 Apr 2015 18:58:35 +0000 (UTC) Received: (qmail 47359 invoked by uid 500); 27 Apr 2015 18:58:35 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 47313 invoked by uid 500); 27 Apr 2015 18:58:35 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 47295 invoked by uid 99); 27 Apr 2015 18:58:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Apr 2015 18:58:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: message received from 54.76.25.247 which is an MX secondary for user@uima.apache.org) Received: from [54.76.25.247] (HELO mx1-eu-west.apache.org) (54.76.25.247) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Apr 2015 18:58:07 +0000 Received: from mail-lb0-f178.google.com (mail-lb0-f178.google.com [209.85.217.178]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id EBD8820655 for ; Mon, 27 Apr 2015 18:58:05 +0000 (UTC) Received: by lbbqq2 with SMTP id qq2so89611338lbb.3 for ; Mon, 27 Apr 2015 11:58:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=KYwgb5SF3JBIwGvVZrOrIeLy6rGvOSogydn3wWQ5QXQ=; b=GM1FDye34C6dVHEgEjAXd1KWMNZ2DzAP33/lzPIDj/zGKXrfSe2MxxdWDRJggkqTD6 YoNQGaDZWVevQLGswBeNE3e8diZv6tb2JYt2ASCKwTh+MmTLHUEjZ5YhI3owh5T1RyJi +KorkGS2hEGK6POSMiUT2iQdXc3ajF/3oTdmXQmC6p4ZsGToVzwziQFLVlE9gO6xlqGw L188VH4xTL8KfyDKqbIxwcEYHHLGbErdx6PxJfKH3ra0/r7uKirH6nOogUyCoe7Zyt53 r9keXoapR5g6xggnlr6GySJjr6pqR1mAKf37DY2FY6GtwakzkbU1/wh2Dc+B0ZRnYb9D aCew== X-Received: by 10.152.37.8 with SMTP id u8mr9705805laj.83.1430161085469; Mon, 27 Apr 2015 11:58:05 -0700 (PDT) Received: from [192.168.22.20] ([87.104.197.212]) by mx.google.com with ESMTPSA id jl4sm4972866lbc.14.2015.04.27.11.58.04 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 27 Apr 2015 11:58:04 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: Error handling in flow control From: Mario Gazzo In-Reply-To: Date: Mon, 27 Apr 2015 20:58:02 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <7FEB93E3-FB74-4F02-90FE-C6C9E885DF40@gmail.com> <1F0C085F-F97D-4C4D-9583-E4443C6D93C7@gmail.com> To: user@uima.apache.org X-Mailer: Apple Mail (2.2098) X-Virus-Checked: Checked by ClamAV on apache.org Thanks Eddie, I think I need to look deeper into CasMultipliers and UIMA-AS but it = sounds more complicated than I hoped. I got something without CAS = multipliers working now and it can get me all the way if I initially = just combine it with snapshots optimised in the new compressed binary = CAS format. I will therefore now be working on other more crucial parts = to get all things wired up first before digging deeper into this but I = might get back to you about it once I have investigated it further. Your = input has been valuable and gives me something to work with. Much appreciated, Mario > On 26 Apr 2015, at 18:26 , Eddie Epstein wrote: >=20 > Very clear, thanks. A CasMultiplier has the ability to deserialize a = CAS > from file and emit it as a child CAS. A parent CAS could have a > FeatureStructure identifying it as one to be rerun from some specific = state > (CAS file), the CM would trigger on the FS and produce the child CAS = to be > reprocessed, the flow controller configured to return the child from = the > aggregate, and the client would then use the child and ignore the = parent. >=20 > An ideal threading solution would be to use UIMA-AS. Unfortunately a > UIMA-AS service currently requires an AMQ broker for service input and > output. It is possible to embed both broker and service in process, = just a > complication and with serialization overhead. >=20 > Another thing to consider is to use the relatively new binary = compressed > CAS form 6, which can save considerable space over zip compressed = XmiCas. > Form 6 has the same ability as XmiCas to be deserialized into a CAS = with > different but compatible typesystem. >=20 > Hope this helps, > Eddie >=20 >=20 > On Sat, Apr 25, 2015 at 2:58 AM, Mario Gazzo = wrote: >=20 >> My apologies for not being very clear. >>=20 >> I managed to get the basic flow control to work after modifying some = AE to >> check for a previous installed sofa before just adding another. >>=20 >> The services I mentioned are not UIMA related but we are migrating >> existing text analysis components to UIMA and these need to integrate = with >> a larger existing setup that rely on various AWS services such as S3, >> DynamoDB, Simple Workflow and EMR. We don=E2=80=99t have as such = plans to use >> UIMA-AS or Vinci but instead we already use AWS Simple Workflow (SWF) = to >> orchestrate all our workers. This means that we just wanted to run = multiple >> UIMA pipelines inside some of these workers using multithreaded CPE. = I am >> now trying to implement this integration by consuming activity tasks = from >> SWF through a collection reader and then have a flow control manage = the >> logic and respond back when the AAE pipeline has completed or failed. = This >> is where I had problems when experimenting with failure handling. >>=20 >> We are storing output from these workers on S3 and in DynamoDB tables = for >> use further downstream in our workflow and online applications. We = also >> store intermediate results (snapshots) on S3 so that we can at any = point go >> back to a previous step and resume, retry or redo processing but it = also >> allows us to inspect data for debugging/analysis purposes. I thought = that I >> might be able to do something similar within the CPE using the CAS = but this >> isn't that simple. E.g. running the same AE twice against the same = CAS >> would result in those annotations occurring twice without carefully >> designing around this. I can still serialize snapshot CAS to XMI on = S3 but >> I can=E2=80=99t just load them again in order to restore them back to = a previous >> state within the same CPE flow. Instead I would have to fail and = initiate a >> retry through SWF, which would cause the previous state to be loaded = from >> S3 into a new CAS via the next worker that receives the retry = activity task >> through its collection reader. However, storing many snapshot CAS = outputs >> will even compressed take a lot more space than the format we are = using in >> our production setup now, so I am considering whether there are = alternative >> approaches but they so far all appear much more complex and brittle. >>=20 >> Indeed CAS multipliers would be useful for us but the limitations of = the >> CPE and the general difficulties I have experienced so far have made = me >> consider implementing a custom multithreaded collection processor but = I >> wanted to avoid this. >>=20 >> Hope this clarifies what I am trying to do. Cheers :) >>=20 >>> On 24 Apr 2015, at 16:50 , Eddie Epstein = wrote: >>>=20 >>> Can you give more details on the overall pipeline deployment? The = initial >>> description mentions a CPE and it mentions services. The CPE was = created >>> before flow controllers or CasMutipliers existed and has no support = of >>> them. Services could be Vinci services for the CPE or UIMA-AS = services or >>> ??? >>>=20 >>> On Fri, Apr 24, 2015 at 5:37 AM, Mario Gazzo >> wrote: >>>=20 >>>> I am trying to get error handling to work with a custom flow = control. I >>>> need to send status information back to a service after the flow >> completed >>>> either with or without errors but I can only do this once for any >> workflow >>>> item because it changes the state of the job, at least without = error >>>> replies and wasteful requests. The problem is that I need to do = several >>>> retries before finally failing and reporting the status to a = service. >> First >>>> I tried to let the CPE do the retry for me by setting the max error >> count >>>> but then a new flow object is created every time and I loose track = of >> the >>>> number of retries before this. This means that I don=E2=80=99t know = when to >> report >>>> the status to the service because it should only happen after the = final >>>> retry. >>>>=20 >>>> I then tried to let the flow instance manage the retries by moving = back >> to >>>> the previous step again but then I get the error >>>> =E2=80=9Corg.apache.uima.cas.CASRuntimeException: Data for Sofa = feature >>>> setLocalSofaData() has already been set=E2=80=9D, which is because = the document >>>> text is set in this particular test case. I then also tried to = reset the >>>> CAS completely before retrying the pipeline from scratch and this = of >> course >>>> throws the error =E2=80=9CCASAdminException: Can't flush CAS, = flushing is >>>> disabled.=E2=80=9D. It would be less wasteful if only the failed = step is retried >>>> instead of the whole pipeline but this requires clean up, which in = some >>>> cases might be impossible. It appears that managing errors can be = rather >>>> complex because the CAS can be in an unknown state and an analysis >> engine >>>> operation is not idempotent. I probably need to start the whole = pipeline >>>> from the start if I want more than a single attempt, which gets me = back >> to >>>> the problem of tracking the number of attempts before reporting = back to >> the >>>> service. >>>>=20 >>>> Does anyone have any good suggestion on how to do this in UIMA e.g. >>>> passing state information from a failed flow to the next flow = attempt? >>>>=20 >>>>=20 >>=20 >>=20