Return-Path: X-Original-To: apmail-clerezza-dev-archive@www.apache.org Delivered-To: apmail-clerezza-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9FD3010786 for ; Thu, 3 Oct 2013 13:24:35 +0000 (UTC) Received: (qmail 5413 invoked by uid 500); 3 Oct 2013 13:24:35 -0000 Delivered-To: apmail-clerezza-dev-archive@clerezza.apache.org Received: (qmail 5329 invoked by uid 500); 3 Oct 2013 13:24:34 -0000 Mailing-List: contact dev-help@clerezza.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@clerezza.apache.org Delivered-To: mailing list dev@clerezza.apache.org Received: (qmail 5317 invoked by uid 99); 3 Oct 2013 13:24:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Oct 2013 13:24:34 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.215.182] (HELO mail-ea0-f182.google.com) (209.85.215.182) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Oct 2013 13:24:29 +0000 Received: by mail-ea0-f182.google.com with SMTP id o10so1082241eaj.41 for ; Thu, 03 Oct 2013 06:24:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:message-id:in-reply-to :references:subject:mime-version:content-type; bh=2qm0YXecr4QnElNaelTN3xhMEhyDvAg8yvPNO25sueM=; b=F8jhFMxi/uNIU58/sxZsT6D46jEJWQ2wRAd57wCiGulii1XmYoeJaZlV1Vl2Dp1slC Rho6PppHzv/c0F2NoS/eOnxipU9eS2qzTnUXi4WbSGyKERA7JbTIyNs4OddGZcCXKpON uc6EWkPkIZB8fGuEXL3HMWwiZaHRTXSFIsvWn68dKjoNA8fvBVhv+CyBqXJHgN9pBK4+ XrbwIeVxrKYb3Ko6RW/dgj6nSJ9kLLldqrPVYFhbwwTvNKXVDoHLQlptzNpMCGbDXDWR +vbLyeuVAlkoccj9kr7OvJUzr//Oj6CP36UaGOgmefrAg7EnY5IEpjDwPyHURbPVQdOk p2OQ== X-Gm-Message-State: ALoCoQnEth7Dn2B+M45Mdo8Qb5q4C9Td3Yi18pVclvcQVm901DhEY5T6XGfjo0W0luCAl/9YZYqb X-Received: by 10.14.29.67 with SMTP id h43mr12712989eea.7.1380806647021; Thu, 03 Oct 2013 06:24:07 -0700 (PDT) Received: from dhcp-10-0-1-108.searchbox.lan (searchbox1.epfl.ch. [128.179.67.165]) by mx.google.com with ESMTPSA id d8sm15768428eeh.8.1969.12.31.16.00.00 (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 03 Oct 2013 06:24:05 -0700 (PDT) Date: Thu, 3 Oct 2013 15:24:07 +0200 From: Stephane Gamard To: Tommaso Teofili , Cc: Message-ID: In-Reply-To: References: Subject: =?UTF-8?Q?Re=3A_Search_in_rdf.cris?= X-Mailer: Airmail (192) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="524d6ff7_5c482a97_1c43" X-Virus-Checked: Checked by ClamAV on apache.org --524d6ff7_5c482a97_1c43 Content-Type: multipart/alternative; boundary="524d6ff7_2463b9ea_1c43" --524d6ff7_2463b9ea_1c43 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thank you Tommaso,=C2=A0 I might need help or at the very least simple pointers and debates over c= ertain principles and guidelines.=C2=A0 =46irst one being: the choice to either abstract everything related to se= arch (such as Sorting fields, query, filters and facets) or to use the Lu= cene native objects. Small overview of pros and cons (for the rdf.cris pa= ckage, not the implemenation packages).=C2=A0 Native Lucene + Objects already exists, well implemented (Sort=46ield, =46acet, =E2=80=A6= ) - Bounds to lucene semantics (fairly easy to use but certain impl provide= rs will have to rewrite using Lucene translation=E2=80=A6 In case someone= wants to make a =22=46ast=22 or GSA impl for clerezza). Note that Lucene= , Solr and Elastic can fairly easily work with Native Lucene Objects +/- Should put all search-ability logic into helper classes as to not for= ce external package to talk =22Lucene=22 Abstracted Classes - LOT of re-coding concepts that are straight forward in Lucene + No Lucene dependancies and no need of helper classes + Not bound to anything impl, rewrite for possible solr, GSA, fast, =E2=80= =A6 will not require basic knowledge of Lucene. I'd be interested on you POV on this. My Main goal is for ppl outside of = the rdf.cris package never having to learn any specialised API while yet = taking advantage of all the IR features of any search engine. =5FStephane On October 3, 2013 at 1:59:07 PM, Tommaso Teofili (tommaso.teofili=40gmai= l.com) wrote: Hi Stephane, =20 I don't have much time now but I just wanted to let you know that IMHO yo= ur =20 list of goals / tasks sounds completely reasonable, in case you need it I= =20 may be able to give some help along the next weeks. =20 Regards, =20 Tommaso =20 2013/10/2 Stephane Gamard =20 > Hi Team, =20 > =20 > My name's Stephane and I am currently participating to the =46usepool =46= P7 =20 > project. Within this project we are using stanbol & clerezza as key =20 > architectural components. Coming from a pure =46ullText search and =20 > Information Retrieval background I find myself in somewhat of a new =20 > territory. =20 > =20 > But within that new territory there is a link to my area of expertise: = =20 > Lucene/Solr in the rdf.cris package. This package turns out to be cruci= al =20 > for our project and I would gladly participate and contribute my knowle= dge =20 > as a Lucene and Solr developer. So here in a nutshell a list of =22smal= l =20 > contributions=22 to start with: =20 > =20 > - Abstraction Refactoring =20 > Currently CRIS is using Lucene as its =46T engine, but we might want to= =20 > eventually go to Solr (or elasticsearch for XYZ reasons). =46irst step = would =20 > be to remove all Lucene dependencies in rdf.cris package and push =20 > implementation in rdf.cris.lucene package =20 > =20 > - Lucene 4.x Branch =20 > There are a large number of changes since the 2.x and 3.x branch of =20 > Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucen= e =20 > package to take advantage of Lucene's new features (=46acets, SearchMan= ager, =20 > =E2=80=A6) =20 > =20 > - Solr Implementation =20 > In line with =22in production=22 I strongly believe clerezza's CRIS com= ponent =20 > should be able to leverage established services without having to manag= e =20 > their scalability. That goes for =46ullText Search most obviously. The = idea =20 > is to be able to use a remote Solr Server (Solr since it comes with the= =20 > whole pseudo-rest servicing on top of Lucene). =20 > =20 > - =46ine Grained Search =20 > As a logical evolution from the points above, it would be then perfect = if =20 > clerezza's fulltext search capabilities could benefit from all the feat= ures =20 > of Lucene/Solr. I am especially thinking about: =20 > -- =46ield/Analyzer specialisation (we don't compare authors, dates and= text =20 > in the same way in Lucene/Solr) =20 > -- Boosting (=46or IR, the title of a document usually yields more impo= rtant =20 > information than its footnotes) =20 > -- Advanced facets (things like date-rage facets, pivot facets (called = 2nd =20 > level facets in fusepool)) =20 > -- Geolocalised searches (big thing in Lucene/Solr 4.x branch=E2=80=A6 = would =20 > eventually be a nice to have) =20 > =20 > I will execute this work over the next few weeks/months as part of the = =20 > fusepool project, but most of all I would be pleased and interested to = =20 > finally get a top-notch implementation of cross rdf-text solution. Very= =20 > much looking forward for your feedback and hopefully support ;) =20 > =20 > PS: who ever initiated the GraphIndexer implementation did an excellent= =20 > job=21 Will hopefully follow in his footsteps=21 =20 > =20 > Cheers, =20 > =20 > =5FStephane =20 > =20 > --524d6ff7_2463b9ea_1c43 Content-Type: multipart/related; boundary="524d6ff7_5e884adc_1c43" --524d6ff7_5e884adc_1c43 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thank you Tommaso, 

I might need help or at the very least sim= ple pointers and debates over certain principles and guidelines. 

=46irst one being: the choice t= o either abstract everything related to search (such as Sorting fields, q= uery, filters and facets) or to use the Lucene native objects. Small over= view of pros and cons (for the rdf.cris package, not the implemenation pa= ckages). 
<= br>
Native Lu= cene
+ Objec= ts already exists, well implemented (Sort=46ield, =46acet, =E2=80=A6)
- Bounds to lucene = semantics (fairly easy to use but certain impl providers will have to rew= rite using Lucene translation=E2=80=A6 In case someone wants to make a =22= =46ast=22 or GSA impl for clerezza). Note that Lucene, Solr and Elastic c= an fairly easily work with Native Lucene Objects
+/- Should put all search-ability logic in= to helper classes as to not force external package to talk =22Lucene=22

Abstracted Classes
- LOT of re-coding= concepts that are straight forward in Lucene
+ No Lucene dependancies and no need of hel= per classes
+ No= t bound to anything impl, rewrite for possible solr, GSA, fast, =E2=80=A6= will not require basic knowledge of Lucene.

I'd be interested on you POV on this. My Main goal is f= or ppl outside of the rdf.cris package never having to learn any speciali= sed API while yet taking advantage of all the IR features of any search e= ngine.

=5FStephane

<= br>

On October 3, 2013 at 1:59:07 PM, To= mmaso Teofili (tommaso.teofili=40gmail.com) wrote:

Hi Stephane,

I don't have much time now but I just wanted to let you know that IMH= O your
list of goals / tasks sounds completely reasonable, in case you need = it I
may be able to give some help along the next weeks.

Regards,
Tommaso


2013/10/2 Stephane Gamard <stephane=40gamard.net>

> Hi Team,
>
> My name's Stephane and I am currently participating to the =46us= epool =46P7
> project. Within this project we are using stanbol & clerezza= as key
> architectural components. Coming from a pure =46ullText search a= nd
> Information Retrieval background I find myself in somewhat of a = new
> territory.
>
> But within that new territory there is a link to my area of expe= rtise:
> Lucene/Solr in the rdf.cris package. This package turns out to b= e crucial
> for our project and I would gladly participate and contribute my= knowledge
> as a Lucene and Solr developer. So here in a nutshell a list of = =22small
> contributions=22 to start with:
>
> - Abstraction Refactoring
> Currently CRIS is using Lucene as its =46T engine, but we might = want to
> eventually go to Solr (or elasticsearch for XYZ reasons). =46irs= t step would
> be to remove all Lucene dependencies in rdf.cris package and pus= h
> implementation in rdf.cris.lucene package
>
> - Lucene 4.x Branch
> There are a large number of changes since the 2.x and 3.x branch= of
> Lucene. I'd propose a small refactor and overhaul of the rdf.cri= s.lucene
> package to take advantage of Lucene's new features (=46acets, Se= archManager,
> =E2=80=A6)
>
> - Solr Implementation
> In line with =22in production=22 I strongly believe clerezza's C= RIS component
> should be able to leverage established services without having t= o manage
> their scalability. That goes for =46ullText Search most obviousl= y. The idea
> is to be able to use a remote Solr Server (Solr since it comes w= ith the
> whole pseudo-rest servicing on top of Lucene).
>
> - =46ine Grained Search
> As a logical evolution from the points above, it would be then p= erfect if
> clerezza's fulltext search capabilities could benefit from all t= he features
> of Lucene/Solr. I am especially thinking about:
> -- =46ield/Analyzer specialisation (we don't compare authors, da= tes and text
> in the same way in Lucene/Solr)
> -- Boosting (=46or IR, the title of a document usually yields mo= re important
> information than its footnotes)
> -- Advanced facets (things like date-rage facets, pivot facets (= called 2nd
> level facets in fusepool))
> -- Geolocalised searches (big thing in Lucene/Solr 4.x branch=E2= =80=A6 would
> eventually be a nice to have)
>
> I will execute this work over the next few weeks/months as part = of the
> fusepool project, but most of all I would be pleased and interes= ted to
> finally get a top-notch implementation of cross rdf-text solutio= n. Very
> much looking forward for your feedback and hopefully support ;)
>
> PS: who ever initiated the GraphIndexer implementation did an ex= cellent
> job=21 Will hopefully follow in his footsteps=21
>
> Cheers,
>
> =5FStephane
>
>
--524d6ff7_5e884adc_1c43-- --524d6ff7_2463b9ea_1c43-- --524d6ff7_5c482a97_1c43--