mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Burcu B <>
Subject Re: UIMA
Date Wed, 15 Jan 2014 13:52:04 GMT

Thank you, Jens. I was planning to use OpenNLP  for named entity
recognition directly  for the analysis you''ve mentioned; and Lucene for
tokenization. However, UIMA has OpenNLP component, too. What is the reason
to use UIMA instead of uisng OpenNLP and SOLR together?

I am planning to use Mahout and R together in the application; but later
other libraries or algorithms could be added to the application. However,
the program should be extended like Atlassian's JIRA plugins. Does UIMA's
component architecture provide this easier compared to other options?

Where does UIMA fit in a system that reads documents from different
sources; removes stop words, identifies named entities; indexs them and
then classifies, clusteres text and indexes topics/labels? I am confused if
& why UIMA should be used or not.


On Wed, Jan 15, 2014 at 1:15 PM, Jens Grivolla <> wrote:

> Hello Burcu,
> UIMA has an entirely different purpose actually, and doesn't do
> classification or clustering.  You would rather use UIMA to enrich
> documents (individually) through text analysis and then use the result to
> create better feature vectors to use with Solr, Mahout, etc.
> We typically use UIMA to do named entity recognition, sentiment analysis,
> chunking, etc. and then index the result in Solr. From there you can either
> use it for retrieval (i.e. use the enriched representation to get a better
> document similarity measure) or extract the vectors to use with
> Mahout/Weka/Cluto/...
> HTH,
> Jens
> On 14/01/14 16:25, Burcu B wrote:
>> Hi,
>> I'd like to know why someone should prefer UIMA when developing an
>> application for end users to classify and cluster general purpose
>> documents?
>> I have two options:
>> 1- integrating Mahout, SOLR, R ,Hadoop and other file sources such as
>>   document man. systems or file system.
>> 2- or doing these using UIMA.
>> Intiutively, I think that UIMA should be preferred, but I could not
>> justify
>> my feeling. I need a list of pros and cons.
>> If you could suggest me resources, it would be great.
>> Thank you.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message