Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C3DD3110C9 for ; Thu, 28 Aug 2014 07:21:58 +0000 (UTC) Received: (qmail 16146 invoked by uid 500); 28 Aug 2014 07:21:52 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 16109 invoked by uid 500); 28 Aug 2014 07:21:52 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 16092 invoked by uid 99); 28 Aug 2014 07:21:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Aug 2014 07:21:51 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,T_HK_NAME_DR X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of arminwegner@googlemail.com designates 74.125.82.47 as permitted sender) Received: from [74.125.82.47] (HELO mail-wg0-f47.google.com) (74.125.82.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Aug 2014 07:21:25 +0000 Received: by mail-wg0-f47.google.com with SMTP id z12so317609wgg.6 for ; Thu, 28 Aug 2014 00:21:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=Envsme59IpNdBbj8Dy1HAJcEju9tHKhGB3UHQSr0YO0=; b=uH8i8fTaLPoS1DWXDPNaC/bLuevtPhT8awsR2IdiJHiSDwGmMFsRwYz60cwnNZTe9s Z4NgXivSLRU9YYZMmbVBNW4wGK4SXCgKVHsE9sEIt+TwuE3xe+BaQeB8T9w7Z32Op75d AxPw4j9QD9ugqE8vXS8fOcaz9cOrqmdb/s6X9hb1V+uIYGOmp7/ZimxyCm4M/C8PsX3L sn2KjkhYcjVDjM5THQv4gA8BE1NHKkOxkudXhIOB8EUzgHKrlVAO5YC+2Oqno5FVcR1j 2s61z5J0Tq5khTA6ps93RnJsqKtUF4Zdw7eOJcffOyXZ7D3Gyan8OOIqC9L4gUXbap2U gnvA== MIME-Version: 1.0 X-Received: by 10.194.171.37 with SMTP id ar5mr2551565wjc.69.1409210484508; Thu, 28 Aug 2014 00:21:24 -0700 (PDT) Received: by 10.194.176.138 with HTTP; Thu, 28 Aug 2014 00:21:24 -0700 (PDT) In-Reply-To: References: <53F74764.40306@gmail.com> <317F1011-4213-4F13-A6E6-FB6D2EDC1C8E@uni-jena.de> Date: Thu, 28 Aug 2014 09:21:24 +0200 Message-ID: Subject: Re: AW: AW: Lucas From: "Dr. Armin Wegner" To: user@uima.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hello Erik, in Lucene 4.9 (maybe earlier), you can replace the Lucene analyzer with a UIMA pipeline. At least the docs say so. I don't know how good it is becaus I've never used it. Cheers, Armin On 8/26/14, Erik F=C3=A4=C3=9Fler wrote: > Hi all, > > actually, I don't use LuCas anymore to write a Lucene index but rather to > send the created documents to Solr or ElasticSearch. There are two reason= s I > continue to use LuCas: It's field merging capabilities and the term cover > mechanics. > Regarding the field merging: I have a lot of machine learning components = in > my pipeline, nothing I could do within a Lucene analyzer. So when I > recognize entities with an ML component in the text and each entity has a= n > ID, then please consider this example: > > Barack Obama entered the White House. > > Let's pretend we would require an ML system to recognize "White House" as > THE one White House and let's say we gave it the ID "entity1". > My goal is to be able to search for the ID in the same way I would do usi= ng > a synonym filter, thus finding a document by terms that originally were n= ot > included in this document's text, AND be able to correctly highlight the > corresponding text snippet. So, when I search for "entity1" (e.g. because > the user wants to see documents dealing with the White House), I want to > find the above example document with the string "Whit House" highlighted. > LuCas can do this for me be aligning or merging the text TokenStream with > the entity TokenStream, just as it is done within the CAS itself. > > If this functionality can be achieved without using LuCas, please tell me= , > I'd be happy to switch to up-to-date maintained default-components. Until > now I am under the impression this cannot be done by another component. > > The term cover mechanics allow me to easily distribute terms across docum= ent > fields in a predefined, possible overlapping, set division, the set cover= . I > use it to automatically deal with a lot of faceting fields. Here, I can > model n:n mappings from CAS indexes to Lucene fields, e.g. mapping terms > originating from one CAS index to 10 Lucene fields, or the other way roun= d. > Again, if this is easily possible with another existing, maintained > component, please point me to it. > > In short: I, too, ultimately don't use Lucene but Solr/ES. However, LuCas > has some (Lucene) document fine-tuning-tuning capabilities I need/work > with. > This means: I don't necessarily need LuCas in an Lucene-updated version. = I > use it more as a fine-tuned TokenStream-smith. I could require it to be > updated in the future when LuCas is not able to express a specific featur= e > of a newer Lucene version. > > I hope this wall of text was understandable, thanks for reading through i= t > ;-) > > Best, > > Erik > > > >> On 26 Aug 2014, at 09:43, wrote: >> >> Hi Erik and J=C3=B6rn, >> >> I've used Solr in the meantime. It is so easy to quickly write a CAS >> consumer that sends documents to a Solr web service. Writing to a Lucene >> index is minimally more work. Could this be the reason why nobody cares >> about the outdated version? Is there really a need for Lucas and Solrcas >> anymore? What do you think? It would be nice to have some opinions on >> this. >> >> Of all people reading this list, who wants to have a Lucas or Solrcas fo= r >> the current version of Lucene? >> >> Cheers, >> Armin >> >> -----Urspr=C3=BCngliche Nachricht----- >> Von: Erik F=C3=A4=C3=9Fler [mailto:erik.faessler@uni-jena.de] >> Gesendet: Freitag, 22. August 2014 16:34 >> An: user@uima.apache.org >> Betreff: Re: AW: Lucas >> >> I am using LuCas in production in the last SNAPSHOT version that can be >> found in the SVN but not in the maven repository. I was also not aware a >> patch would be required to get it to work, I am using it in its current >> SVN state, including the splitter filter. >> I would be willing to help with a migration and contribute to >> discussions/plans. However, I won't have time to do it all on my own, >> especially since I use it as a bridge to Solr/ElasticSearch that kind of >> remedies the version difference. Thus I use it with newer Solr/ES versio= ns >> without problems so far. >> >> I will be on vacations for two weeks, after that I'd be available for >> contributions. >> >> Best, >> >> Erik >> >>> On 22 Aug 2014, at 15:36, J=C3=B6rn Kottmann wrote= : >>> >>> It would probably nice to migrate those to the current versions of >>> Lucene/Solr. >>> >>> J=C3=B6rn >>> >>>> On 08/13/2014 08:44 AM, Armin.Wegner@bka.bund.de wrote: >>>> Hi Renauld, >>>> >>>> that's nice, thank you. Are you using Lucene 4.x or an older version? >>>> >>>> It's a while ago, that I've asked that question and I didn't get much >>>> response. Is the project dead? Is it just to easy to code a simple >>>> annotator for Lucene or Solr to justify the effort maintaining Lucas a= nd >>>> Solrcas? >>>> >>>> Cheers, >>>> Armin >>>> >>>> >>>> -----Urspr=C3=BCngliche Nachricht----- >>>> Von: Renaud Richardet [mailto:renaud.richardet@epfl.ch] >>>> Gesendet: Montag, 11. August 2014 23:12 >>>> An: user@uima.apache.org >>>> Betreff: Re: Lucas >>>> >>>> Hi Armin, >>>> >>>> I used it a while ago. I had to apply the following patch to make it >>>> work: >>>> https://gist.github.com/renaud/bc34a48ca22f787f6c11 >>>> >>>> HTH, Renaud >>>> >>>> >>>>> On Mon, Jul 28, 2014 at 2:55 PM, wrote: >>>>> >>>>> Hi! >>>>> >>>>> Is someone using Lucas? It seems to be slightly outdated. It depends >>>>> on Lucene 2.9.3. Lucene is at version 4.9.0 right now. Is there an >>>>> alternative? >>>>> >>>>> Regards, >>>>> Armin >>>> >>>> -- >>>> Renaud Richardet >>>> Blue Brain Project PhD candidate >>>> EPFL Station 15 >>>> CH-1015 Lausanne >>>> phone: +41-78-675-9501 >>>> http://people.epfl.ch/renaud.richardet >>> >