uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: UIMA analysis from a database
Date Thu, 14 Sep 2017 22:32:56 GMT
Hi Benedict

Not sure this is helpful for you, but only an advice.
I recommend usint UIMA for what it is first intended : nlp pipeline.

When dealing with multi threaded application, I would go for dedicated
technologies.

I have been successfuly using UIMA together with apache spark. While this
design works well on a single computer, I am now able to distribute UIMA
pipeline over dosen of them, withou extra need. 

Then I focus on UIMA pipeline doing it's job well, and after testing,
industrialize them over spark.

Advantages of this design are:
- benefit from spark distributing expertise (note failure, memory
  consumption, data partitionning...)
- simplify UIMA programming (no multithread inside, only NLP stuff)
- scale when needed (add more chip computer, get better performances)
- get expertise with spark, and use it with any java code you d'like
- spark do have JDBC connectors and may be able to fetch data in
  multithread easily.

you can have an wotking example in my repo https://github.com/parisni/UimaOnSpark
This have not been simple to make it working, but I can tell know this
methods is robust and optimized.


Le 14 sept. 2017 à 21:24, Benedict Holland écrivait :
> Hello everyone,
> 
> I am trying to get my project off the ground and hit a small problem.
> 
> I want to read text from a large database (lets say, 100,000+ rows). Each
> row will have a text article. I want to connect to the database, request a
> single row from the database, and process this document through an NLP
> engine and I want to do this in parallel. Each document will be say, split
> up into sentences and each sentence will be POS tagged.
> 
> After reading the documentation, I am more confused than when I started. I
> think I want something like the FileSystemCollectionReader example and
> build a CPE. Instead of reading from the file system, it will read from the
> database.
> 
> There are two problems with this approach:
> 
> 1. I am not sure it is multi threaded: CAS initializers are deprecated and
> it appears that the getNext() method will only run in a single thread.
> 2. The FileSystemCollectionReader loads references to the file location
> into memory but not the text itself.
> 
> For problem 1, the line I find very troubling is
> 
> File file = (File) mFiles.get(mCurrentIndex++);
> 
> I have to assume from this line that the CollectionReader_ImplBase is not
> multi-threaded but is intended to rapidly iterate over a set of documents
> in a single thread.
> 
> Problem 2 is easily solved as I can create a massive array of integers if I
> feel like.
> 
> Anyway, after deciding that this is not likely the solution, I looked into
> Multi-view Sofa annotators. I don't think these do what I want either. In
> this context, I would treat the database table as a single object with many
> "views" being chunks of rows. I don't think this works, based on the
> SofaExampleAnnotator code provided. It also appears to run in a single
> thread.
> 
> This leaves me with CAS pools. I know that this is doing to be
> multi-threaded. I believe I create however many CAS objects from the
> annotator I want, probably an aggregate annotator. Is this correct and am I
> on the right track with CAS Pools?
> 
> Thank you so much,
> ~Ben

Mime
View raw message