uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Petr Baudis <pa...@ucw.cz>
Subject Re: Using UIMA to build an NLP system
Date Sun, 26 Apr 2015 09:05:00 GMT

On Sun, Apr 26, 2015 at 10:12:05AM +0200, Martin Wunderlich wrote:
> To provide a concrete scenario, would UIMA be useful in modeling the following processing
pipeline, given a corpus consisting of a number of text documents: 
> - annotate each doc with meta-data extracted from it, such as publication date
> - preprocess the corpus, e.g. by stopword removal and lemmatization
> - save intermediate pre-processed and annotated versions of corpus (so that pre-processing
has to be done only once)
> - run LDA (e.g. using Mallet) on the entire training corpus to model topics, with number
of topics ranging, for instance, from 50 to 100
> - convert each doc to a feature vector as per the LDA model
> - extract paragraphs from relevant documents and use for unsupervised pre-training in
a deep learning architecture (built using e.g. Deeplearning4J)

  I think up to here, UIMA would be a good choice for you.

> - train and test an SVM for supervised text classification (binary classification into
„relevant“ vs. „non-relevant“) using cross-validation
> - store each trained SVM
> - report results of CV into CSV file for further processing

  The moment stop dealing with *unstructured* data and just do feature
vectors and classifier objects, it's imho easier to get out of UIMA,
but that may not be a big deal.

> Would UIMA be a good choice to build and manage a project like this? 
> What would be the advantages of UIMA compared to using simple shell scripts for „gluing
together“ the individual components? 

  Well, UIMA provides the gluing so you don't have to do it yourself,
it's not that small amount of work:

  (i) a common container (CAS) for annotated data
  (ii) pipeline flow control that also supports scale out
  (iii) the DKpro project, which lets you effortlessly perform NLP
annotations, interface resources etc. using off-the-shelf NLP components

  For me, UIMA had a rather steep learning curve.  But that was largely
because my pipeline is highly non-linear and I didn't use the Eclipse
GUI tools; I would hope things should go pretty easily in a simpler
scenario with a completely linear pipeline like yours.

  P.S.: Also, use UIMAfit to build your pipeline, ignore the annotator
XML descriptors you see in the UIMA User Guide.  I recommend that you
just look at the DKpro example suite to get started up quickly.

				Petr Baudis
	If you do not work on an important problem, it's unlikely
	you'll do important work.  -- R. Hamming

View raw message