uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Ott <n...@sfs.uni-tuebingen.de>
Subject Re: Lucene cas consumer
Date Thu, 04 Dec 2008 18:36:59 GMT
Hi all,

I'm using both Lucene and UIMA in one project.

Lucene is primarily an information retrieval API. It provides a
framework and default implementations for analyzing several languages.
Analyzing means tokenization, stop words, etc. Furthermore, it brings
the key functionality to build an inverted index and to search it.

Lucene can be extended easily. E.g. one can implement an analyzer that
does lemmatization or that looks up synonyms in Wordnet  and adds them
to the index.

What Lucene cannot do - or at least not without a lot of hacking - is
aggregating analyses as UIMA can using the CAS. Usually your knowledge
grows during an UIMA-based NLP-pipeline: you add the a token annotation,
a lemma annotation, a POS-annotation and so on...  In Lucene, you have
the classical pipeline: the output replaces the input. (Yes, by
subclassing Lucene's "Token" class, one can fiddle around the issue, but
it is not elegant at all.)

What makes Lucene + UIMA interesting for me is a simple fact: I can do
all the NLP I want and be as flexible as I need in UIMA. Then I can feed
the outcome (or rather: a small part of it) into a Lucene index.

In my special case, I'm not using a CAS Consumer, but I can imagine
other people would appreciate it in their application scenarios.

To conclude: Lucene and UIMA aren't competitors, but in some cases 
having one feeding the other is what you want.

Best,

    Niels


Greg Holmberg schrieb:
> Roberto--
> 
> It does seem like there should be a close relationship between the
> two groups.
> 
> I don't know much about Lucene--can you educate me?  For example,
> have you given any thought to what to do with UIMA annotations?  From
> what little I've read about Lucene, they seem to have a thing called
> a document analyzer, but they don't mean the same thing we mean by
> analysis in the NLP community.  They appear to mean something more
> like "tokenizer".  So I haven't yet found a place to put UIMA
> annotations, say for example, named entities or parts of speech.  I'm
> wondering if Lucene needs a major feature enhancement before its
> truly useful with UIMA?
> 
> What are your thoughts on how the integrate the two?  What
> functionality is possible?
> 
> Greg Holmberg
> 
> 
> -------------- Original message ---------------------- From: "Roberto
> Franchini" <ro.franchini@gmail.com>
>> Hi, I'm going to write a Lucene CAS consumer. The porpouse is to
>> create a Lucene document, or more than one, for each CAS. Last year
>> (2007)  the JENA university lab (JULIE lab? is it right?) delivered
>> such a component, named LUCAS. Then it disappeared. LUCAS seems a
>> good piece of software. The Technische Universit´┐Żt Darmstadt
>> developed one too: http://www.ukp.tu-darmstadt.de/projects/dkpro/.
>> (I will write to them).
>> 
>> There's anybody interested to share knowledge and/or code to do
>> that component? I think that Lucene and UIMA can be very good
>> friends :)
>> 
>> Roberto
>> 
>> PS: I apologize for my bad English.
>> 
>> -- Roberto Franchini http://www.celi.it http://www.blogmeter.it 
>> http://www.memesphere.it Tel +39-011-6600814 
>> jabber:ro.franchini@gmail.com skype:ro.franchini


-- 
Niels Ott - Computational Linguist (B.A.) - http://www.drni.de/niels/
           - My PGP key is available from your favorite key server.

Wer im Glashaus sitzt, sollte immer Sidolin dabei haben!

Mime
View raw message