Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 24124 invoked from network); 4 Dec 2008 18:43:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Dec 2008 18:43:58 -0000 Received: (qmail 10957 invoked by uid 500); 4 Dec 2008 18:44:10 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 10930 invoked by uid 500); 4 Dec 2008 18:44:10 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 10919 invoked by uid 99); 4 Dec 2008 18:44:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2008 10:44:10 -0800 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [134.2.3.3] (HELO mx06.uni-tuebingen.de) (134.2.3.3) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2008 18:42:37 +0000 Received: from [134.2.167.11] (vpn0761.extern.uni-tuebingen.de [134.2.167.11]) by mx06.uni-tuebingen.de (8.13.6/8.13.6) with ESMTP id mB4Ib0eA007477; Thu, 4 Dec 2008 19:37:01 +0100 Message-ID: <4938234B.1050408@sfs.uni-tuebingen.de> Date: Thu, 04 Dec 2008 19:36:59 +0100 From: Niels Ott User-Agent: Thunderbird 2.0.0.16 (X11/20080904) MIME-Version: 1.0 To: uima-user@incubator.apache.org CC: Roberto Franchini Subject: Re: Lucene cas consumer References: <120420081812.22059.49381D82000A70670000562B2200751090C0C0CFCD099D0A0D03040108@comcast.net> In-Reply-To: <120420081812.22059.49381D82000A70670000562B2200751090C0C0CFCD099D0A0D03040108@comcast.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-AntiVirus: checked by AntiVir MailGate (version: 2.1.2-11; AVE: 7.9.0.41; VDF: 7.1.0.189; host: mx06) X-Virus-Checked: Checked by ClamAV on apache.org Hi all, I'm using both Lucene and UIMA in one project. Lucene is primarily an information retrieval API. It provides a framework and default implementations for analyzing several languages. Analyzing means tokenization, stop words, etc. Furthermore, it brings the key functionality to build an inverted index and to search it. Lucene can be extended easily. E.g. one can implement an analyzer that does lemmatization or that looks up synonyms in Wordnet and adds them to the index. What Lucene cannot do - or at least not without a lot of hacking - is aggregating analyses as UIMA can using the CAS. Usually your knowledge grows during an UIMA-based NLP-pipeline: you add the a token annotation, a lemma annotation, a POS-annotation and so on... In Lucene, you have the classical pipeline: the output replaces the input. (Yes, by subclassing Lucene's "Token" class, one can fiddle around the issue, but it is not elegant at all.) What makes Lucene + UIMA interesting for me is a simple fact: I can do all the NLP I want and be as flexible as I need in UIMA. Then I can feed the outcome (or rather: a small part of it) into a Lucene index. In my special case, I'm not using a CAS Consumer, but I can imagine other people would appreciate it in their application scenarios. To conclude: Lucene and UIMA aren't competitors, but in some cases having one feeding the other is what you want. Best, Niels Greg Holmberg schrieb: > Roberto-- > > It does seem like there should be a close relationship between the > two groups. > > I don't know much about Lucene--can you educate me? For example, > have you given any thought to what to do with UIMA annotations? From > what little I've read about Lucene, they seem to have a thing called > a document analyzer, but they don't mean the same thing we mean by > analysis in the NLP community. They appear to mean something more > like "tokenizer". So I haven't yet found a place to put UIMA > annotations, say for example, named entities or parts of speech. I'm > wondering if Lucene needs a major feature enhancement before its > truly useful with UIMA? > > What are your thoughts on how the integrate the two? What > functionality is possible? > > Greg Holmberg > > > -------------- Original message ---------------------- From: "Roberto > Franchini" >> Hi, I'm going to write a Lucene CAS consumer. The porpouse is to >> create a Lucene document, or more than one, for each CAS. Last year >> (2007) the JENA university lab (JULIE lab? is it right?) delivered >> such a component, named LUCAS. Then it disappeared. LUCAS seems a >> good piece of software. The Technische Universit�t Darmstadt >> developed one too: http://www.ukp.tu-darmstadt.de/projects/dkpro/. >> (I will write to them). >> >> There's anybody interested to share knowledge and/or code to do >> that component? I think that Lucene and UIMA can be very good >> friends :) >> >> Roberto >> >> PS: I apologize for my bad English. >> >> -- Roberto Franchini http://www.celi.it http://www.blogmeter.it >> http://www.memesphere.it Tel +39-011-6600814 >> jabber:ro.franchini@gmail.com skype:ro.franchini -- Niels Ott - Computational Linguist (B.A.) - http://www.drni.de/niels/ - My PGP key is available from your favorite key server. Wer im Glashaus sitzt, sollte immer Sidolin dabei haben!