creadur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <>
Subject Re: [RAT] Pipelines...
Date Mon, 05 Aug 2013 14:47:49 GMT

On 8/5/2013 10:11 AM, Robert Burrell Donkin wrote:
> Essentially, Rat is simple.
> A source (perhaps a file system or a compressed archive) is walked, producing
> documents. Each document (perhaps a file in a file system, or a resources in
> an archive) flows through a pipeline - a series of processing steps, enriching
> with various meta-data. An end point collates the data.
> It seems to me that the current code fails to express this
> ...
> At the moment, IDocumentAnalyser[1] is implemented by most steps in the
> pipeline (and other stuff too), wired together in a potentially flexible
> fashion. This now seems over-engineered to me.
> I think a concrete Pipeline would be more obvious, with controlled extension
> points at each step of the processing.
> Opinions...?
> Objections...?


It may be overkill ( :-) ), however, the Apache UIMA project has this very idea
of enabling assembly of components in a pipeline, and passing a thing (called
the CAS - Common Annotation Structure/System) to each "annotator" component,
which may add arbitrary metadata info to the CAS.

For intro, see the getting started parts of the documentation at

-Marshall Schor

> Robert
> [1]

View raw message