uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Terrier" <olivier.terr...@temis.com>
Subject RE: Lucene cas consumer
Date Fri, 05 Dec 2008 08:44:56 GMT
Hi all

We, at Temis, have also made a prototype integration of Lucene and UIMA as a proof of concept.
More exactly we have written a Solr Cas consumer.
Solr http://lucene.apache.org/solr/ is a Lucene sub project that provide a kind of indexation
server layer on top of Lucene.
The idea behind was to be able to index documents using a UIMA processing chain with both
full-text and entities based on UIMA annotations.
More over Solr provides a support for 'faceted search' that can be based on annotation.
Let's suppose you have a UIMA typesystem that defines annotations like Person, Company, Location
etc... You can easily index these entities into a lucene index using the Solr java API.
In the prototype we also used a Solr contribution (not already integrated in the trunk) names
solr-ui available here
It provides a simple UI to serach into your indexed documents using a combination of full
text and facets (look at attached screenshot).
Of course our Solr consumer is for now a very basic piece of code: for example it is tightly
linked to our own typesystem but we would be more than happy to collaborate with the communtiy
on this subject if there is interest.


Olivier Terrier

> -----Message d'origine-----
> De : Niels Ott [mailto:nott@sfs.uni-tuebingen.de] 
> Envoyé : jeudi 4 décembre 2008 19:37
> À : uima-user@incubator.apache.org
> Cc : Roberto Franchini
> Objet : Re: Lucene cas consumer
> Hi all,
> I'm using both Lucene and UIMA in one project.
> Lucene is primarily an information retrieval API. It provides 
> a framework and default implementations for analyzing several 
> languages.
> Analyzing means tokenization, stop words, etc. Furthermore, 
> it brings the key functionality to build an inverted index 
> and to search it.
> Lucene can be extended easily. E.g. one can implement an 
> analyzer that does lemmatization or that looks up synonyms in 
> Wordnet  and adds them to the index.
> What Lucene cannot do - or at least not without a lot of 
> hacking - is aggregating analyses as UIMA can using the CAS. 
> Usually your knowledge grows during an UIMA-based 
> NLP-pipeline: you add the a token annotation, a lemma 
> annotation, a POS-annotation and so on...  In Lucene, you 
> have the classical pipeline: the output replaces the input. 
> (Yes, by subclassing Lucene's "Token" class, one can fiddle 
> around the issue, but it is not elegant at all.)
> What makes Lucene + UIMA interesting for me is a simple fact: 
> I can do all the NLP I want and be as flexible as I need in 
> UIMA. Then I can feed the outcome (or rather: a small part of 
> it) into a Lucene index.
> In my special case, I'm not using a CAS Consumer, but I can 
> imagine other people would appreciate it in their application 
> scenarios.
> To conclude: Lucene and UIMA aren't competitors, but in some 
> cases having one feeding the other is what you want.
> Best,
>     Niels
> Greg Holmberg schrieb:
> > Roberto--
> > 
> > It does seem like there should be a close relationship 
> between the two 
> > groups.
> > 
> > I don't know much about Lucene--can you educate me?  For 
> example, have 
> > you given any thought to what to do with UIMA annotations?  
> From what 
> > little I've read about Lucene, they seem to have a thing called a 
> > document analyzer, but they don't mean the same thing we mean by 
> > analysis in the NLP community.  They appear to mean something more 
> > like "tokenizer".  So I haven't yet found a place to put UIMA 
> > annotations, say for example, named entities or parts of 
> speech.  I'm 
> > wondering if Lucene needs a major feature enhancement 
> before its truly 
> > useful with UIMA?
> > 
> > What are your thoughts on how the integrate the two?  What 
> > functionality is possible?
> > 
> > Greg Holmberg
> > 
> > 
> > -------------- Original message ---------------------- 
> From: "Roberto 
> > Franchini" <ro.franchini@gmail.com>
> >> Hi, I'm going to write a Lucene CAS consumer. The porpouse is to 
> >> create a Lucene document, or more than one, for each CAS. Last year
> >> (2007)  the JENA university lab (JULIE lab? is it right?) 
> delivered 
> >> such a component, named LUCAS. Then it disappeared. LUCAS seems a 
> >> good piece of software. The Technische Universit t Darmstadt 
> >> developed one too: http://www.ukp.tu-darmstadt.de/projects/dkpro/.
> >> (I will write to them).
> >> 
> >> There's anybody interested to share knowledge and/or code 
> to do that 
> >> component? I think that Lucene and UIMA can be very good friends :)
> >> 
> >> Roberto
> >> 
> >> PS: I apologize for my bad English.
> >> 
> >> -- Roberto Franchini http://www.celi.it http://www.blogmeter.it 
> >> http://www.memesphere.it Tel +39-011-6600814 
> >> jabber:ro.franchini@gmail.com skype:ro.franchini
> --
> Niels Ott - Computational Linguist (B.A.) - http://www.drni.de/niels/
>            - My PGP key is available from your favorite key server.
> Wer im Glashaus sitzt, sollte immer Sidolin dabei haben!

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message