uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: Best approach for analyzing a set of documents
Date Thu, 03 Oct 2013 13:57:28 GMT

On 10/3/2013 1:14 AM, ThanhDK wrote:
> Hi all,
> I am new to UIMA and from what I see, the concept of AE is very
> single-document centric. My question is, from UIMA point of view, what is
> the standard way to write a analysis component of which input is a set of
> documents? For instance, a clustering engine that clusters similar documents
> to the same basket, or an trending topic detector that detect new topics
> from a set of documents.
> I had a look at the CPE  before but it looks to me like just a iterator that
> collect documents one by one, send it through the AEs and collects the output.

A bit of history may be helpful.

In the beginning, UIMA had Collection Readers and Cas Consumers.  These were
conceptually intended to go at the beginning and end of pipelines.  The
Collection Readers would read "work-items" (e.g., documents - but UIMA can
process things other than documents, for instance, video clips, etc.) and push
those through the pipeline.  And Cas Consumers would do something with the
results of the analysis (e.g., write them to a file, a database, etc.).

Later, UIMA introduced the concept of a CAS Multiplier.  This generalized the
Collection Reader a bit, allowing it to be anywhere in a pipeline, not just at
the beginning.

Later, it became clear that the Collection Reader and Cas Consumer were just
parameterizations of normal Analysis Engines, so they were replaced by those. 
The older classes still work, though.

So the current way to do what your asking is to use an Analysis Engine specified
as a Cas Multiplier to generate the CASes flowing in the pipeline, and to use an
Analysis Engine set up like a Cas Consumer (for instance, specify the properties
in the <operationalProperties> element to indicate that
multipleDeploymentAllowed is false (to cause all the CASes to flow into this one
instance, if that's what's needed).

This approach enables the same pipeline to be run on a laptop for testeing, and
then scaled up (e.g. using UIMA-AS) to a big cluster of machines (for processing
very large document collections).  The CPE was a first implementation of
scaleout; the current, more flexible and powerful version is UIMA-AS.

> Regards

View raw message