Return-Path: Delivered-To: apmail-incubator-uima-user-archive@locus.apache.org Received: (qmail 8360 invoked from network); 5 Dec 2008 08:50:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Dec 2008 08:50:27 -0000 Received: (qmail 14897 invoked by uid 500); 5 Dec 2008 08:50:39 -0000 Delivered-To: apmail-incubator-uima-user-archive@incubator.apache.org Received: (qmail 14661 invoked by uid 500); 5 Dec 2008 08:50:38 -0000 Mailing-List: contact uima-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: uima-user@incubator.apache.org Delivered-To: mailing list uima-user@incubator.apache.org Received: (qmail 14650 invoked by uid 99); 5 Dec 2008 08:50:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Dec 2008 00:50:38 -0800 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [193.168.50.54] (HELO SMT02001.global-sp.net) (193.168.50.54) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Dec 2008 08:49:06 +0000 Received: from EXV01001.GlobalSP.local (unknown [172.20.30.5]) by SMT02001.global-sp.net (Postfix) with ESMTP id 5FFBA553202; Fri, 5 Dec 2008 09:49:25 +0100 (CET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----_=_NextPart_001_01C956B6.3AD81742" Subject: RE: Lucene cas consumer Date: Fri, 5 Dec 2008 09:44:56 +0100 Message-ID: <205239E4006D14469A2D6CEC72AE925401B73481@EXV01001.GlobalSP.local> X-MS-Has-Attach: yes X-MS-TNEF-Correlator: Thread-Topic: Lucene cas consumer Thread-Index: AclWQEjFmNpfHCDnTq+Mi6EyYN4BOwAcpNdA From: "Olivier Terrier" To: Cc: "Roberto Franchini" X-global-asp-net-MailScanner-ID: 5FFBA553202.E6BDF X-global-asp-net-MailScanner: Found to be clean X-global-asp-net-MailScanner-SpamCheck: X-MailScanner-From: olivier.terrier@temis.com X-Virus-Checked: Checked by ClamAV on apache.org ------_=_NextPart_001_01C956B6.3AD81742 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi all We, at Temis, have also made a prototype integration of Lucene and UIMA = as a proof of concept. More exactly we have written a Solr Cas consumer. Solr http://lucene.apache.org/solr/ is a Lucene sub project that provide = a kind of indexation server layer on top of Lucene. The idea behind was to be able to index documents using a UIMA = processing chain with both full-text and entities based on UIMA = annotations. More over Solr provides a support for 'faceted search' that can be based = on annotation. Let's suppose you have a UIMA typesystem that defines annotations like = Person, Company, Location etc... You can easily index these entities = into a lucene index using the Solr java API. In the prototype we also used a Solr contribution (not already = integrated in the trunk) names solr-ui available here https://issues.apache.org/jira/browse/SOLR-634 It provides a simple UI to serach into your indexed documents using a = combination of full text and facets (look at attached screenshot). Of course our Solr consumer is for now a very basic piece of code: for = example it is tightly linked to our own typesystem but we would be more = than happy to collaborate with the communtiy on this subject if there is = interest. Regards Olivier Terrier Temis > -----Message d'origine----- > De : Niels Ott [mailto:nott@sfs.uni-tuebingen.de]=20 > Envoy=E9 : jeudi 4 d=E9cembre 2008 19:37 > =C0 : uima-user@incubator.apache.org > Cc : Roberto Franchini > Objet : Re: Lucene cas consumer >=20 > Hi all, >=20 > I'm using both Lucene and UIMA in one project. >=20 > Lucene is primarily an information retrieval API. It provides=20 > a framework and default implementations for analyzing several=20 > languages. > Analyzing means tokenization, stop words, etc. Furthermore,=20 > it brings the key functionality to build an inverted index=20 > and to search it. >=20 > Lucene can be extended easily. E.g. one can implement an=20 > analyzer that does lemmatization or that looks up synonyms in=20 > Wordnet and adds them to the index. >=20 > What Lucene cannot do - or at least not without a lot of=20 > hacking - is aggregating analyses as UIMA can using the CAS.=20 > Usually your knowledge grows during an UIMA-based=20 > NLP-pipeline: you add the a token annotation, a lemma=20 > annotation, a POS-annotation and so on... In Lucene, you=20 > have the classical pipeline: the output replaces the input.=20 > (Yes, by subclassing Lucene's "Token" class, one can fiddle=20 > around the issue, but it is not elegant at all.) >=20 > What makes Lucene + UIMA interesting for me is a simple fact:=20 > I can do all the NLP I want and be as flexible as I need in=20 > UIMA. Then I can feed the outcome (or rather: a small part of=20 > it) into a Lucene index. >=20 > In my special case, I'm not using a CAS Consumer, but I can=20 > imagine other people would appreciate it in their application=20 > scenarios. >=20 > To conclude: Lucene and UIMA aren't competitors, but in some=20 > cases having one feeding the other is what you want. >=20 > Best, >=20 > Niels >=20 >=20 > Greg Holmberg schrieb: > > Roberto-- > >=20 > > It does seem like there should be a close relationship=20 > between the two=20 > > groups. > >=20 > > I don't know much about Lucene--can you educate me? For=20 > example, have=20 > > you given any thought to what to do with UIMA annotations? =20 > From what=20 > > little I've read about Lucene, they seem to have a thing called a=20 > > document analyzer, but they don't mean the same thing we mean by=20 > > analysis in the NLP community. They appear to mean something more=20 > > like "tokenizer". So I haven't yet found a place to put UIMA=20 > > annotations, say for example, named entities or parts of=20 > speech. I'm=20 > > wondering if Lucene needs a major feature enhancement=20 > before its truly=20 > > useful with UIMA? > >=20 > > What are your thoughts on how the integrate the two? What=20 > > functionality is possible? > >=20 > > Greg Holmberg > >=20 > >=20 > > -------------- Original message ----------------------=20 > From: "Roberto=20 > > Franchini" > >> Hi, I'm going to write a Lucene CAS consumer. The porpouse is to=20 > >> create a Lucene document, or more than one, for each CAS. Last year > >> (2007) the JENA university lab (JULIE lab? is it right?)=20 > delivered=20 > >> such a component, named LUCAS. Then it disappeared. LUCAS seems a=20 > >> good piece of software. The Technische Universit t Darmstadt=20 > >> developed one too: http://www.ukp.tu-darmstadt.de/projects/dkpro/. > >> (I will write to them). > >>=20 > >> There's anybody interested to share knowledge and/or code=20 > to do that=20 > >> component? I think that Lucene and UIMA can be very good friends :) > >>=20 > >> Roberto > >>=20 > >> PS: I apologize for my bad English. > >>=20 > >> -- Roberto Franchini http://www.celi.it http://www.blogmeter.it=20 > >> http://www.memesphere.it Tel +39-011-6600814=20 > >> jabber:ro.franchini@gmail.com skype:ro.franchini >=20 >=20 > -- > Niels Ott - Computational Linguist (B.A.) - http://www.drni.de/niels/ > - My PGP key is available from your favorite key server. >=20 > Wer im Glashaus sitzt, sollte immer Sidolin dabei haben! >=20 ------_=_NextPart_001_01C956B6.3AD81742--