uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Gazzo <mario.ga...@gmail.com>
Subject Re: Using UIMA to build an NLP system
Date Sun, 26 Apr 2015 11:44:41 GMT
Hej Martin,

I agree with Peter. We are in the process of migrating our existing text analysis components
to UIMA coming from an approach that more closely resembles what you would call just "gluing
things together”. This works well when you initially just experiment with rapid prototypes.
I think UIMA could in this phase even get in the way if you don’t already understand it
very well. However, once you need to scale the dev team and move to production then these
ad-hoc approaches become a problem. A framework like UIMA gives you a systematic development
approach for the whole team and once you have climbed the steep learning curve then I believe
it can also be a faster prototyping tool because it makes it easier to quickly combine different
components in a new pipeline. An important factors for us was therefore also the diverse ecosystem
of quality analysis components like DKPro, cTakes, clearTK etc. You can even integrate Gate
components and vice versa (see https://gate.ac.uk/sale/tao/splitch22.html#chap:uima <https://gate.ac.uk/sale/tao/splitch22.html#chap:uima>)
although I haven’t myself played with this yet.

We are not using the distributed scale out features of UIMA but rely on various AWS services
instead although it takes a bit of tinkering to figure out how to do this but we are gradually
getting there. Generally we do the unstructured NLP processing on document by document basis
in UIMA but then we do corpus wide structured analysis using map reduce type of approaches
outside UIMA. That said, we are now also moving towards stream based approaches since we have
to ingest large amount of data continuously. Doing very large MR batch jobs on a daily basis
is in our case wasteful and impractical.

I think UIMA feels a bit "old school” with all these XML descriptions but there is purpose
behind this once you start understanding the architecture. Luckily this is where UIMAfit comes
to the rescue. We don’t use the Eclipse tools at all but integrate JCasGen with Gradle using
this nice plugin: https://github.com/Dictanova/gradle-jcasgen-plugin <https://github.com/Dictanova/gradle-jcasgen-plugin>.
I would wish there was direct support for Gradle as well. We don’t want to rely on these
IDE specific tools ourselves since we use both Eclipse and Intellij IDEA in development and
we need to have the code generation tools integrated with the automated build process. The
main difference is that we only need to write the type definitions directly in XML and for
the analysis engine and pipeline descriptions we can just use UIMAfit. However, be prepared
to do some digging since not every detail is covered as well in the UIMAfit documentation
as it is for the general UIMA framework. Community responses on this mailing is a big plus
though.

Cheers
Mario


> On 26 Apr 2015, at 11:05 , Petr Baudis <pasky@ucw.cz> wrote:
> 
>  Hi!
> 
> On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote:
>> To provide a concrete scenario, would UIMA be useful in modeling the following processing
pipeline, given a corpus consisting of a number of text documents: 
>> 
>> - annotate each doc with meta-data extracted from it, such as publication date
>> - preprocess the corpus, e.g. by stopword removal and lemmatization
>> - save intermediate pre-processed and annotated versions of corpus (so that pre-processing
has to be done only once)
>> - run LDA (e.g. using Mallet) on the entire training corpus to model topics, with
number of topics ranging, for instance, from 50 to 100
>> - convert each doc to a feature vector as per the LDA model
> +
>> - extract paragraphs from relevant documents and use for unsupervised pre-training
in a deep learning architecture (built using e.g. Deeplearning4J)
> 
>  I think up to here, UIMA would be a good choice for you.
> 
>> - train and test an SVM for supervised text classification (binary classification
into „relevant“ vs. „non-relevant“) using cross-validation
>> - store each trained SVM
>> - report results of CV into CSV file for further processing
> 
>  The moment stop dealing with *unstructured* data and just do feature
> vectors and classifier objects, it's imho easier to get out of UIMA,
> but that may not be a big deal.
> 
>> Would UIMA be a good choice to build and manage a project like this? 
>> What would be the advantages of UIMA compared to using simple shell scripts for „gluing
together“ the individual components? 
> 
>  Well, UIMA provides the gluing so you don't have to do it yourself,
> it's not that small amount of work:
> 
>  (i) a common container (CAS) for annotated data
>  (ii) pipeline flow control that also supports scale out
>  (iii) the DKpro project, which lets you effortlessly perform NLP
> annotations, interface resources etc. using off-the-shelf NLP components
> 
>  For me, UIMA had a rather steep learning curve.  But that was largely
> because my pipeline is highly non-linear and I didn't use the Eclipse
> GUI tools; I would hope things should go pretty easily in a simpler
> scenario with a completely linear pipeline like yours.
> 
>  P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator
> XML descriptors you see in the UIMA User Guide.  I recommend that you
> just look at the DKpro example suite to get started up quickly.
> 
> -- 
> 				Petr Baudis
> 	If you do not work on an important problem, it's unlikely
> 	you'll do important work.  -- R. Hamming
> 	http://www.cs.virginia.edu/~robins/YouAndYourResearch.html


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message