uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benedict Holland <benedict.m.holl...@gmail.com>
Subject UIMA analysis from a database
Date Thu, 14 Sep 2017 19:24:31 GMT
Hello everyone,

I am trying to get my project off the ground and hit a small problem.

I want to read text from a large database (lets say, 100,000+ rows). Each
row will have a text article. I want to connect to the database, request a
single row from the database, and process this document through an NLP
engine and I want to do this in parallel. Each document will be say, split
up into sentences and each sentence will be POS tagged.

After reading the documentation, I am more confused than when I started. I
think I want something like the FileSystemCollectionReader example and
build a CPE. Instead of reading from the file system, it will read from the

There are two problems with this approach:

1. I am not sure it is multi threaded: CAS initializers are deprecated and
it appears that the getNext() method will only run in a single thread.
2. The FileSystemCollectionReader loads references to the file location
into memory but not the text itself.

For problem 1, the line I find very troubling is

File file = (File) mFiles.get(mCurrentIndex++);

I have to assume from this line that the CollectionReader_ImplBase is not
multi-threaded but is intended to rapidly iterate over a set of documents
in a single thread.

Problem 2 is easily solved as I can create a massive array of integers if I
feel like.

Anyway, after deciding that this is not likely the solution, I looked into
Multi-view Sofa annotators. I don't think these do what I want either. In
this context, I would treat the database table as a single object with many
"views" being chunks of rows. I don't think this works, based on the
SofaExampleAnnotator code provided. It also appears to run in a single

This leaves me with CAS pools. I know that this is doing to be
multi-threaded. I believe I create however many CAS objects from the
annotator I want, probably an aggregate annotator. Is this correct and am I
on the right track with CAS Pools?

Thank you so much,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message