uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eddie Epstein <eaepst...@gmail.com>
Subject Re: UIMA analysis from a database
Date Fri, 15 Sep 2017 20:12:15 GMT
DUCC does have hooks to allow entire machines to be dynamically added or
removed from a running DUCC cluster. So in principal DUCC could be run as
an application under a different resource manager as long as resources were
available at the machine level. It should also be possible to run other
infrastructures under DUCC, for example where a Hadoop/Spark subcluster is
turned on and off as required.

One not cloud friendly aspect of DUCC has been a dependency on a shared
filesystem. There has been work done recently to remove this requirement,
and the latest release can run without a shared FS, but some useful
functionality is not available. Specifically, facilitating the distribution
of user application code to worker machines, and automatically retrieving
user logfiles written to local disk to the DUCC web console.

regards,
Eddie

On Fri, Sep 15, 2017 at 2:54 PM, Fox, David <david.fox@optum.com> wrote:

> Another thanks to all contributing to this thread.
>
> We¹re looking to transition a NLP large application processing ~30TB/month
> from a custom NLP framework to UIMA-AS, and from parallel processing on a
> dedicated cluster with custom python scripts which call gnu parallel, to
> something with better support for managing resources on a shared cluster.
>
> Both our internal IT/engineering group and our cluster vendor
> (HortonWorks) use and support Hadoop/Spark/YARN on a new shared cluster.
> DUCC¹s capabilities seem to overlap with these more general purpose tools.
>  Although it may be more closely aligned with UIMA for a dedicated
> cluster, I think the big question for us would be how/whether it would
> play nicely with other Hadoop/Spark/YARN jobs on the shared cluster.
> We¹re also likely to move at least some of our workload to a cloud
> computing host, and it seems like Hadoop/Spark are much more likely to be
> supported there.
>
> David Fox
>
> On 9/15/17, 1:57 PM, "Eddie Epstein" <eaepstein@gmail.com> wrote:
>
> >There are a few DUCC features that might be of particular interest for
> >scaling out UIMA analytics.
> >
> > - all user code for batch processing continues to use the existing UIMA
> >component model: collection readers, cas multiplers, analysis engines, and
> >cas consumers.**
> >
> > - DUCC supports assembling and debugging a single threaded process with
> >these components, and then with no code change launch a highly scaled out
> >deployment.
> >
> > - for applications that use too much RAM to be able to utilize all the
> >cores on worker machines, DUCC can do the vertical (thread) scaleout
> >needed
> >to share memory.
> >
> > - DUCC automatically captures the performance breakdown of the UIMA-based
> >processes, as well as capturing process statistics including CPU, RAM,
> >swap, pagefaults and GC. Performance breakdown info for individual tasks
> >(DUCC work items) can optionally be captured.
> >
> > - DUCC has extensive error handling, automatically resubmitting work
> >associated with uncaught exceptions, process crashes, machine failures,
> >network failures, etc.
> >
> > - Exceptions are convenient to get to, and an attempt is made to make
> >obvious things that might be tricky to find, such all the reasons a
> >process
> >might fail to start, without having to dig through DUCC framework logs.
> >
> >** DUCC services introduce a new user programmable component, a service
> >pinger, that is responsible for validating that a service is operating
> >correctly. The service pinger can also dynamically change the number of
> >instances of a service, and it can restart individual instances that are
> >determined to be acting badly.
> >
> >Eddie
> >
> >On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D <josborne@uabmc.edu>
> >wrote:
> >
> >> Thanks Richard and Nicholas,
> >>
> >> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?
> >>
> >> It's also doing UIMA on Spark - I'm wondering if you are aware of it and
> >> how it is different from your own project?
> >>
> >> Thanks for any info,
> >>
> >>  -John
> >>
> >>
> >> ________________________________________
> >> From: Richard Eckart de Castilho [rec@apache.org]
> >> Sent: Friday, September 15, 2017 5:29 AM
> >> To: user@uima.apache.org
> >> Subject: Re: UIMA analysis from a database
> >>
> >> On 15.09.2017, at 09:28, Nicolas Paris <niparisco@gmail.com> wrote:
> >> >
> >> > - UIMA-AS is another way to program UIMA
> >>
> >> Here you probably meant uimaFIT.
> >>
> >> > - UIMA-FIT is complicated
> >> > - UIMA-FIT only work with UIMA
> >>
> >> ... and I suppose you mean UIMA-AS here.
> >>
> >> > - UIMA only focuses on text Annotation
> >>
> >> Yep. Although it has also been used for other media, e.g. video and
> >>audio.
> >> But the core UIMA framework doesn't specifically consider these media.
> >> People who apply it UIMA in the context of other media do so with custom
> >> type systems.
> >>
> >> > - UIMA is not good at:
> >> >       - text transformation
> >>
> >> It is not straight-forward but possible. E.g. the text normalizers in
> >> DKPro Core make use of either different views for different states of
> >> normalization or drop the original text and forward the normalized
> >> text within a pipeline by means of a CAS multiplier.
> >>
> >> >       - read data from source in parallel
> >> >       - write data to folder in parallel
> >>
> >> Not sure if these two are limitations of the framework
> >> rather than of the way that you use readers and writers
> >> in the particular scale-out mode you are working with.
> >>
> >> >       - machine learning interface
> >>
> >> UIMA doesn't offer ML as part of the core framework because
> >> that is simply not within the scope of what the UIMA framework
> >> aims to achieve.
> >>
> >> There are various people who have built ML around UIMA, e.g.
> >> ClearTK (https://urldefense.proofpoint.com/v2/url?u=http-
> >> 3A__cleartk.github.io_cleartk_&d=DwICAw&c=o3PTkfaYAd6-
> No7SurnLtwPssd47t-
> >>
> >>De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_
> ZbFo&m=tAU9eh1Sq_D
> >>-
> >> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=J1-BGfzlrX9t3-
> >> Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM&e= ) or DKPro TC
> >> (https://urldefense.proofpoint.com/v2/url?u=https-
> >>
> >>3A__dkpro.github.io_dkpro-2Dtc_&d=DwICAw&c=o3PTkfaYAd6-
> No7SurnLtwPssd47t-
> >>
> >>De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_
> ZbFo&m=tAU9eh1Sq_D
> >>-
> >> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=kye5D2izwKE_9V2QQW8leiKp0p-91U-
> >> CFwXJMFmCd3w&e= ) - and as you did, it
> >> can be combined in various ways with ML frameworks that
> >> specialize specifically on ML.
> >>
> >>
> >> Cheers,
> >>
> >> -- Richard
> >>
> >>
> >>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message