uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Fäßler <erik.faess...@uni-jena.de>
Subject Re: Synchonizing Batches AE and StatusCallbackListener
Date Tue, 25 Apr 2017 12:56:19 GMT
Thanks for all the input! I have some reading to do now ;-)



> On 21 Apr 2017, at 23:22, Eddie Epstein <eaepstein@gmail.com> wrote:
> Hi Erik,
> A few words about DUCC and your application. DUCC is a cluster controller
> that includes a resource manager and 3 applications: batch processing, long
> running services and singleton processes.
> The batch processing application consists of a users CollectionReader which
> defines work items and a users aggregate for processing work items that can
> be replicated as desired across the cluster of machines. DUCC manages the
> remote process scale out and distribution of work items. The aggregate can
> be vertically scaled within each process so that in-heap data can be shared
> by multiple instances of the aggregate. UIMA-AS is not required for this
> simple threading model.
> For most applications a work item is itself a collection, a CAS containing
> references to the data to be processed, where the collection size is
> designed to have small enough granularity to support scale out but big
> enough granularity to avoid bottlenecks.
> The users aggregate normally has an initial CasMultiplier that reads the
> input data and creates the CASes to be fed to the rest of the pipeline.
> When all children CASes are finished processing the work item CAS is routed
> to the aggregate's CasConsumer to finalize the collection. DUCC considers
> the work item complete only when the work item CAS is successfully
> processed.
> The system is quite robust to errors: uncaught exceptions, analytics
> crashing, machines crashing, etc.
> Regards,
> Eddie
> On Fri, Apr 21, 2017 at 2:12 PM, Olga Patterson <olga.patterson@utah.edu>
> wrote:
>> Erik,
>> My team at the VA have developed an easy way of implementing UIMA AS
>> pipelines and scaling them to a large number of nodes - using Leo framework
>> that extends UIMA AS 2.8.1. We have run pipelines on over 200M documents
>> scaled across multiple nodes with dozens of service instances and it
>> performs great.
>> Here is some info:
>> http://department-of-veterans-affairs.github.io/Leo/
>> The documentation for Leo reflects an earlier version of Leo, but if you
>> are interested in using it with Java 8 and UIMA 2.8.1, we have not released
>> the latest version in on the VA github yet but we can share it with you so
>> that you can test it out and possibly provide your comments back to us.
>> Leo has simple-to-use functionality for flexible batch read and write and
>> it can work with any UIMA AEs and existing descriptor files and type system
>> descriptions, so if you already have a pipeline, wrapping it with Leo would
>> take just a few lines of code.
>> Let me know if you are interested and I can help you to get started.
>> Olga Patterson
>> -----Original Message-----
>> From: Jaroslaw Cwiklik <uimaee@gmail.com>
>> Reply-To: "user@uima.apache.org" <user@uima.apache.org>
>> Date: Friday, April 21, 2017 at 8:08 AM
>> To: "user@uima.apache.org" <user@uima.apache.org>
>> Subject: Re: Synchonizing Batches AE and StatusCallbackListener
>>    Erik, thanks. This is more clear what you are trying to accomplish.
>> First,
>>    there are no plans to retire the CPE. It is supported and I don't know
>> of
>>    any plans to retire it. The only issue is ongoing development. My
>> efforts
>>    are focused on extending and improving UIMA-AS.
>>    I don't have an answer yet how to handle the CPE crash scenario with
>>    respect to batching and subsequent restart from the last known good
>> batch.
>>    Seems like some coordination would be needed to avoid redoing the whole
>>    collection after a crash. Its been awhile since I've looked at the CPE.
>>    Will take a look and see what is possible if anything.
>>    There is another Apache UIMA project called DUCC which stands for
>>    Distributed Uima Cluster Computing. From your email it looks like you
>> have
>>    a cluster of machines available. Here is a quick description of DUCC:
>>    DUCC is a Linux cluster controller designed to scale out any UIMA
>> pipeline
>>    for high throughput collection processing jobs as well as for low
>> latency
>>    real-tme applications. Building on UIMA-AS, DUCC is particularly well
>>    suited to run large memory Java analytics in multiple threads in order
>> to
>>    fully utilize multicore machines. DUCC manages the life cycle of all
>>    processes deployed across the cluster, including non-UIMA processes
>> such as
>>    tomcat servers or VNC sessions.
>>     You can find more info on this here:
>>    https://uima.apache.org/doc-uimaducc-whatitam.html
>>    In UIMA-AS batching is an application concern. I am a bit fuzzy on
>>    implementation so perhaps someone else can comment how to implement
>>    batching and how to handle errors. You can use a CasMultipler and a
>> custom
>>    FlowController to manage CASes and react to errors.The UIMA-AS service
>> can
>>    take an input CAS representing your batch, pass it on to the
>> CasMultiplier,
>>    generate CASes for each piece of work and deliver results to the
>>    CasConsumer with a FlowController in the middle orchestrating the
>> flow. I
>>    defer to application deployment experts to provide you with more
>> detail.
>>    Jerry
>>    On Fri, Apr 21, 2017 at 2:21 AM, Erik Fäßler <
>> erik.faessler@uni-jena.de>
>>    wrote:
>>> Hi Jerry,
>>> thanks a lot for your answer! I’m sorry that I didn’t make myself
>> clearer.
>>> I will try again! :-)
>>> Here comes a lot of text, sorry for that. The post actually has two
>> parts:
>>> The first explaining my issue, the second responding to the pointer
>> to
>>> UIMA-AS.
>>> First: Yes, I use a CPE. I process text documents. Tens of millions
>> of
>>> them.
>>> So, I have the following components to my issue, running with the
>> CPE.
>>> 1. A CAS-Consumer (just an AnalysisEngine internally, of course).
>> This
>>> consumer is responsible to serialise the document CAS into XMI and
>> send the
>>> XMI to a database. It is a XMI-to-database consumer. For performance
>>> reasons, the XMI of multiple CASes is buffered and then sent as a
>> batch,
>>> lets say, 50 CAS XMIs at a time.
>>> 2. A CPE StatusCallbackListener which also writes to the same
>> database,
>>> but in another table. It logs into the database which documents have
>> been
>>> successfully processed by the CPE. It also works on a batch basis.
>>> The goal: The CallbackListener should only mark those documents as
>>> successfully processed (i.e. as “finished”) where the CAS-Consumer
>> actually
>>> has sent the XMI data to the database.
>>> Reason: I don’t want documents marked as “finished” where the XMI
>> data is
>>> not in the database but still in the CAS buffer. Because when now the
>>> pipeline crashes, the XMI data never gets sent to the database.
>> Then, the
>>> processing state is inconsistent: Documents that have not been
>> written into
>>> the database are marked as successfully processed. But their data is
>>> missing.
>>> Also, not each XMI data is stored. There is a condition in the
>> consumer to
>>> decide whether the XMI is to be stored or not. Thus, I cannot “create
>>> consistency” by checking which XMI made it into the database.
>>> Is this better understandable?
>>> Regarding UIMA-AS:
>>> I tried it out a few years back when it was rather new, UIMA 2.3.1 or
>>> something. Back then, it was like the following:
>>> 1. Install a broker (or something - ActiveMQ was it called?)
>>> 2. Configure it.
>>> 3. Start it.
>>> 4. For each AE you want to use, deploy the AE on some server in your
>>> cluster (multiple AEs can be bundled into an AAE).
>>> 5. Start a reader process that will then fill the broker queue.
>>> 6. Wait until processing is finished.
>>> 7. Stop all the AE services deployed to the cluster, if you want to
>> save
>>> the resources.
>>> 8. Stop the broker.
>>> This was quite a while back so perhaps this is not exactly how it
>> was. But
>>> it seemed overly complex to me. I had to login into each server
>> where I
>>> wanted work to be done. We have like 20 nodes or something. Perhaps
>> I could
>>> write a script for that, but then I would have to keep track of the
>> servers
>>> that are free to use at a current time. Because I am not the only
>> one using
>>> the cluster.
>>> And then I have to stop all AE “services”. Until then, they will use
>>> memory because they just idle when there is nothing more to do.
>>> In contrast, CPEs are self-contained projects in my case which I can
>>> distribute easily through our job system (SLURM).
>>> I thought all the setup for UIMA-AS would pay out in better
>> performance.
>>> But in my - admittedly limited - tests there was not much of a
>> performance
>>> difference. CPEs seemed to be a bit faster due to the lack of CAS
>>> serialization between reader and AEs.
>>> Of course, this was years in the past. Is the process a bit simpler
>> today?
>>> Or perhaps I got it wrong to begin with, that’s possible. But I read
>> the
>>> documentation back then and couldn’t see how to do things much
>> simpler.
>>> BUT if CPEs can’t solve my issue and UIMA-AS can, then perhaps I
>> will try
>>> it again.
>>> Another question: You said “CPE was replaced by UIMA-AS”. Does that
>> mean
>>> that CPEs will eventually be removed from UIMA? Are they still a
>> part of
>>> UIMA 3?
>>> Sorry for all the text!
>>> Best regards and thanks!
>>> Erik
>>>> On 20 Apr 2017, at 20:31, Jaroslaw Cwiklik <uimaee@gmail.com>
>> wrote:
>>>> Hi Erik, sorry for a delay responding to your question. This seems
>> like a
>>>> CPE question is this right? I am not quite following what is the
>> issue
>>> you
>>>> are running into. Could you explain this better? With a clearer
>> problem
>>>> description perhaps others will jump in with an answer  :)
>>>> Just FYI, the CPE was replaced by the UIMA-AS quite a long time
>> ago.
>>>> Perhaps UIMA-AS can work better for you. You can read about it
>> here:
>>>> https://uima.apache.org/d/uima-as-2.9.0/uima_async_scaleout.html
>>>> Jerry
>>>> UIMA Team
>>>> On Tue, Apr 18, 2017 at 5:56 AM, Erik Fäßler <
>> erik.faessler@uni-jena.de>
>>>> wrote:
>>>>> Hi all,
>>>>> I have a use case where a consumer of mine sends CAS XMI data
>> into a
>>>>> database in batchProcessComplete(). I also use a
>> StatusCallbackListener
>>>>> that logs into the database whether a document has been completed
>>>>> processing, this is also done batch wise.
>>>>> Now the issue is, if the pipeline crashes for any reason, I must
>> start
>>>>> over because the “completion” flag from the CallbackListener and
>> the
>>> data
>>>>> actually sent by the XMI consumer is not synchronised, i.e. I
>> don’t
>>> know if
>>>>> the data has actually been sent for a document that has completed
>>>>> processing because everything is done batch-wise and not
>> immediately for
>>>>> performance reasons. I also cannot just look into the database
>> which XMI
>>>>> data is there because it only gets sent on a met condition.
>>>>> I would like to somehow communicate between the consumer and the
>>>>> CallbackListener to send their data for the same documents in
>>> agreement. Is
>>>>> there anything I can do to achieve this?
>>>>> Best,
>>>>> Erik

View raw message