uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <richard.eck...@gmail.com>
Subject Re: Using uima pipeline as an API
Date Thu, 18 Jul 2013 07:54:54 GMT
> I have this particular requirement for a API that we wrap over a Uima 
> pipeline.
> public List<String> analyse(String inputFolderPath, String modelName);
> This method is supposed to accept a collection of files (residing in the 
> inputFolderPath), run the files (as CAS) through a pipeline of UIMA AEs, and 
> return the results (one String per CAS).
> To return the strings, I will need to somehow access the CAS after the AEs 
> have finished their job and transform/extract whatever inside the CAS into 
> the string that I will return to the caller of this method.
> But if I run the AEs using a SimplePipeline.runPipeline()
> How I can get hold of the CAS that are coming out of the AEs?
> Do I attach a CAS Consumer at the tail of the pipeline and read the CAS 
> contents at that point? Then I append each result to the List<String> that I 
> constructed at the begining.

You should take a look at the JCasIterable (cf. [1] - Example in Groovy, but
JCasIterable is a Java class and works nicely in Java too, just I have no 
example in Java).

JCasIterable basically allows you to iterate over the CASes produced by your
pipeline. In such a look, you can extract and collect the data you need from
the CASes, e.g. putting into a List<String> and returning it. Mind that you
should *not* try to keep hold of full CASes, FeatureStructure (including
Annotations and stuff). You need to copy the data from the CAS, otherwise
it will be corrupted.

> If so, is this scalable? 

Well… up to a point, but not in general.

> If I have thousands of files in the inputFolderPath, and if the strings are 
> very large, would I run out of memory soon?
> Is there a more scalable way to do this?

You could write your strings to a file and then return an implementation of 
List<String> which directly accesses the file. Depending on how much you want
to scale, you'll have to look into different solutions. The easiest would be
to buy more memory, the most complex would probably be porting your stuff to
some kind of cluster. The latter will most likely require a change of API,
possibly even of the whole processing paradigm. List<String> most probably
won't do then ;)


-- Richard

[1] http://code.google.com/p/dkpro-core-asl/wiki/GroovyRecipies#OpenNLP_Part-of-speech_tagging_pipeline_using_JCasIterable_and_c
View raw message