From Eddie Epstein <eaepst...@gmail.com>
Subject Re: UIMA DUCC slow processing
Date Fri, 12 Jun 2020 21:18:09 GMT

In this simple scenario there is a CollectionReader running in a JobDriver
process, delivering 100K workitems to multiple remote JobProcesses. The
processing time is essentially zero.  (30 * 60 seconds) / 100,000 workitems
= 18 milliseconds per workitem. This time is roughly the expected overhead
of a DUCC jobDriver delivering workitems to remote JobProcesses and
recording the results. DUCC jobs are much more efficient if the overhead
per workitem is much smaller than the processing time.

Typically DUCC jobs would be processing much larger blocks of content per
workitem. For example, if a workitem was a document, and the document
parsed into the small CASes by the CasMultiplier, the throughput would be
much better. However, with this example, as the number of working
JobProcess threads is scaled up, the CR (JobDriver) would become a
bottleneck. That's why a typical DUCC Job will not send the Document
content as a workitem, but rather send a reference to the workitem content
and have the CasMultipliers in the JobProcesses read the content directly
from the source.

Even though content read by the JobProcesses is much more efficient, as
scaleout continued to increase for this non-computation scenario the
bottleneck would eventually move to the underlying filesystem or whatever
document source and JobProcess output are. The main motivation for DUCC was
jobs similar to those in the DUCC examples which use OpenNLP to process
large documents. That is, jobs where CPU processing is the bottleneck
rather than I/O.

Hopefully this helps. If not, happy to continue the discussion.

On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
raja.m.sulaiman@gmail.com> wrote:

> Hi,
> Thank you for your reply and I'm sorry I couldn't get back to this
> earlier.
> To get a better picture of the processing speed of DUCC, I made a dummy
> pipeline where the CollectionReader runs a for loop to generate 100k
> workitems (so no disk reads). each workitem only has a simple string in it.
> These are then passed on to the CasMultiplier where for each workitem I'm
> creating a new CAS with DocumentInfo (again only having a simple string
> value) and pass it as a newcas to the CasConsumer. The CasConsumer doesn't
> do anything except add the Document received in the CAS to the logger. So
> basically this pipeline isn't doing anything, no Input reads and the only
> output is the information added to the logger. Running this on the cluster
> with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more than
> 30 minutes. I don't understand how is this possible since there's no heavy
> I/O processing is happening in the code.
> Any ideas please?
> Thank you.
> On 2020/05/18 12:47:41, Eddie Epstein <eaepstein@gmail.com> wrote:
> > Hi,
> >
> > Removing the AE from the pipeline was a good idea to help isolate the
> > bottleneck. The other two most likely possibilities are the collection
> > reader pulling from elastic search or the CAS consumer writing the
> > processing output.
> >
> > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > Please give a more complete picture of the processing scenario on DUCC.
> >
> > Regards,
> > Eddie
> >
> >
> > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > Sulemanr@edgehill.ac.uk> wrote:
> >
> > > Hi,
> > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> nodes
> > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> data
> > > from an Elasticsearch index and dump it into a new index after certain
> > > analysis engine processing. The Analysis Engine is a simple sentiment
> > > analysis code. The performance I'm getting is very slow as it is only
> able
> > > to process ~150 documents/minute.
> > > To test the performance without the analysis engine, I removed the AE
> from
> > > the pipeline but still I did not get any improvement in the processing
> > > speeds. Can you please guide me as to where I might be going wrong or
> what
> > > I can do to improve the processing speeds?
> > >
> > > Thank you.
> >

