uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fox, David" <david....@optum.com>
Subject Re: UIMA analysis from a database
Date Fri, 15 Sep 2017 18:54:36 GMT
Another thanks to all contributing to this thread.

We¹re looking to transition a NLP large application processing ~30TB/month
from a custom NLP framework to UIMA-AS, and from parallel processing on a
dedicated cluster with custom python scripts which call gnu parallel, to
something with better support for managing resources on a shared cluster.

Both our internal IT/engineering group and our cluster vendor
(HortonWorks) use and support Hadoop/Spark/YARN on a new shared cluster.
DUCC¹s capabilities seem to overlap with these more general purpose tools.
 Although it may be more closely aligned with UIMA for a dedicated
cluster, I think the big question for us would be how/whether it would
play nicely with other Hadoop/Spark/YARN jobs on the shared cluster.
We¹re also likely to move at least some of our workload to a cloud
computing host, and it seems like Hadoop/Spark are much more likely to be
supported there.

David Fox

On 9/15/17, 1:57 PM, "Eddie Epstein" <eaepstein@gmail.com> wrote:

>There are a few DUCC features that might be of particular interest for
>scaling out UIMA analytics.
> - all user code for batch processing continues to use the existing UIMA
>component model: collection readers, cas multiplers, analysis engines, and
>cas consumers.**
> - DUCC supports assembling and debugging a single threaded process with
>these components, and then with no code change launch a highly scaled out
> - for applications that use too much RAM to be able to utilize all the
>cores on worker machines, DUCC can do the vertical (thread) scaleout
>to share memory.
> - DUCC automatically captures the performance breakdown of the UIMA-based
>processes, as well as capturing process statistics including CPU, RAM,
>swap, pagefaults and GC. Performance breakdown info for individual tasks
>(DUCC work items) can optionally be captured.
> - DUCC has extensive error handling, automatically resubmitting work
>associated with uncaught exceptions, process crashes, machine failures,
>network failures, etc.
> - Exceptions are convenient to get to, and an attempt is made to make
>obvious things that might be tricky to find, such all the reasons a
>might fail to start, without having to dig through DUCC framework logs.
>** DUCC services introduce a new user programmable component, a service
>pinger, that is responsible for validating that a service is operating
>correctly. The service pinger can also dynamically change the number of
>instances of a service, and it can restart individual instances that are
>determined to be acting badly.
>On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D <josborne@uabmc.edu>
>> Thanks Richard and Nicholas,
>> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?
>> It's also doing UIMA on Spark - I'm wondering if you are aware of it and
>> how it is different from your own project?
>> Thanks for any info,
>>  -John
>> ________________________________________
>> From: Richard Eckart de Castilho [rec@apache.org]
>> Sent: Friday, September 15, 2017 5:29 AM
>> To: user@uima.apache.org
>> Subject: Re: UIMA analysis from a database
>> On 15.09.2017, at 09:28, Nicolas Paris <niparisco@gmail.com> wrote:
>> >
>> > - UIMA-AS is another way to program UIMA
>> Here you probably meant uimaFIT.
>> > - UIMA-FIT is complicated
>> > - UIMA-FIT only work with UIMA
>> ... and I suppose you mean UIMA-AS here.
>> > - UIMA only focuses on text Annotation
>> Yep. Although it has also been used for other media, e.g. video and
>> But the core UIMA framework doesn't specifically consider these media.
>> People who apply it UIMA in the context of other media do so with custom
>> type systems.
>> > - UIMA is not good at:
>> >       - text transformation
>> It is not straight-forward but possible. E.g. the text normalizers in
>> DKPro Core make use of either different views for different states of
>> normalization or drop the original text and forward the normalized
>> text within a pipeline by means of a CAS multiplier.
>> >       - read data from source in parallel
>> >       - write data to folder in parallel
>> Not sure if these two are limitations of the framework
>> rather than of the way that you use readers and writers
>> in the particular scale-out mode you are working with.
>> >       - machine learning interface
>> UIMA doesn't offer ML as part of the core framework because
>> that is simply not within the scope of what the UIMA framework
>> aims to achieve.
>> There are various people who have built ML around UIMA, e.g.
>> ClearTK (https://urldefense.proofpoint.com/v2/url?u=http-
>> 3A__cleartk.github.io_cleartk_&d=DwICAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-
>> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=J1-BGfzlrX9t3-
>> Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM&e= ) or DKPro TC
>> (https://urldefense.proofpoint.com/v2/url?u=https-
>> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=kye5D2izwKE_9V2QQW8leiKp0p-91U-
>> CFwXJMFmCd3w&e= ) - and as you did, it
>> can be combined in various ways with ML frameworks that
>> specialize specifically on ML.
>> Cheers,
>> -- Richard

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

View raw message