Return-Path: X-Original-To: apmail-clerezza-dev-archive@www.apache.org Delivered-To: apmail-clerezza-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 862A910016 for ; Wed, 2 Oct 2013 14:05:02 +0000 (UTC) Received: (qmail 28145 invoked by uid 500); 2 Oct 2013 14:05:01 -0000 Delivered-To: apmail-clerezza-dev-archive@clerezza.apache.org Received: (qmail 28051 invoked by uid 500); 2 Oct 2013 14:04:56 -0000 Mailing-List: contact dev-help@clerezza.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@clerezza.apache.org Delivered-To: mailing list dev@clerezza.apache.org Received: (qmail 28043 invoked by uid 99); 2 Oct 2013 14:04:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Oct 2013 14:04:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.215.169] (HELO mail-ea0-f169.google.com) (209.85.215.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Oct 2013 14:04:48 +0000 Received: by mail-ea0-f169.google.com with SMTP id k11so423789eaj.0 for ; Wed, 02 Oct 2013 07:04:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:message-id:in-reply-to :references:subject:mime-version:content-type; bh=3XVhcDLuMplbue88d/Yh80uV4ESBBnGmklaZcPsoTUQ=; b=Kuj8y+UwV87qoK6cBanu7GhGDD4VmYnGqjfa0NGfMtfqYVv/J7QP+5mZI7EYX7xnSR U+gZNHMbOiaOf+XnjFlYsvz5lfucYYoGZpiXe2JDILLM09hUs6KZ4Z3BcVFLeh2X69uq Ul6S4Zy03Vv/1twfqkq57G3kZzsc7ZHkAKHnIbvI7QC5m+WrxaJUZRri4buPir+SWQWu 5atToBeR0r5Arq0mtqn0jKjoMmTXKHoegQhjuSuLf76frxaKqh0MI8KIr+vApSrq4Sk2 JS/359Md1xbnV3k0oYg3R162JQTtPGCWxIJt39EZNfkfZxR6AS1FZh9cOrnRfvLQ2nQF bWew== X-Gm-Message-State: ALoCoQnFX8inGzjYUjGMtnxgTJkP9pMpdZf/wfxQG0r4IKsnMC+CCBMbwjO3LtntX/eHwOsQrybe X-Received: by 10.14.204.5 with SMTP id g5mr6576eeo.110.1380722666696; Wed, 02 Oct 2013 07:04:26 -0700 (PDT) Received: from dhcp-10-0-1-108.searchbox.lan (searchbox1.epfl.ch. [128.179.67.165]) by mx.google.com with ESMTPSA id n48sm4128306eeg.17.1969.12.31.16.00.00 (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 02 Oct 2013 07:04:25 -0700 (PDT) Date: Wed, 2 Oct 2013 16:04:27 +0200 From: Stephane Gamard To: Cc: Message-ID: In-Reply-To: References: Subject: Search in rdf.cris X-Mailer: Airmail (192) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="524c27eb_54e49eb4_1c43" X-Virus-Checked: Checked by ClamAV on apache.org --524c27eb_54e49eb4_1c43 Content-Type: multipart/alternative; boundary="524c27eb_71f32454_1c43" --524c27eb_71f32454_1c43 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Team,=C2=A0 My name's Stephane and I am currently participating to the =46usepool =46= P7 project. Within this project we are using stanbol & clerezza as key ar= chitectural components. Coming from a pure =46ullText search and Informat= ion Retrieval background I find myself in somewhat of a new territory. But within that new territory there is a link to my area of expertise: Lu= cene/Solr in the rdf.cris package. This package turns out to be crucial f= or our project and I would gladly participate and contribute my knowledge= as a Lucene and Solr developer. So here in a nutshell a list of =22small= contributions=22 to start with:=C2=A0 - Abstraction Refactoring Currently CRIS is using Lucene as its =46T engine, but we might want to e= ventually go to Solr (or elasticsearch for XYZ reasons). =46irst step wou= ld be to remove all Lucene dependencies in rdf.cris package and push impl= ementation in rdf.cris.lucene package - Lucene 4.x Branch There are a large number of changes since the 2.x and 3.x branch of Lucen= e. I'd propose a small refactor and overhaul of the rdf.cris.lucene packa= ge to take advantage of Lucene's new features (=46acets, SearchManager, =E2= =80=A6) - Solr Implementation In line with =22in production=22 I strongly believe clerezza's CRIS compo= nent should be able to leverage established services without having to ma= nage their scalability. That goes for =46ullText Search most obviously. T= he idea is to be able to use a remote Solr Server (Solr since it comes wi= th the whole pseudo-rest servicing on top of Lucene). - =46ine Grained Search As a logical evolution from the points above, it would be then perfect if= clerezza's fulltext search capabilities could benefit from all the featu= res of Lucene/Solr. I am especially thinking about:=C2=A0 -- =46ield/Analyzer specialisation (we don't compare authors, dates and t= ext in the same way in Lucene/Solr) -- Boosting (=46or IR, the title of a document usually yields more import= ant information than its footnotes) -- Advanced facets (things like date-rage facets, pivot facets (called 2n= d level facets in fusepool)) -- Geolocalised searches (big thing in Lucene/Solr 4.x branch=E2=80=A6 wo= uld eventually be a nice to have) I will execute this work over the next few weeks/months as part of the fu= sepool project, but most of all I would be pleased and interested to fina= lly get a top-notch implementation of cross rdf-text solution. Very much = looking forward for your feedback and hopefully support ;) PS: who ever initiated the GraphIndexer implementation did an excellent j= ob=21 Will hopefully follow in his footsteps=21=C2=A0 Cheers,=C2=A0 =5FStephane --524c27eb_71f32454_1c43 Content-Type: multipart/related; boundary="524c27eb_2ca88611_1c43" --524c27eb_2ca88611_1c43 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Team, 

My name's Stephane and I am currently participating to = the =46usepool =46P7 project. Within this project we are using stanbol &a= mp; clerezza as key architectural components. Coming from a pure =46ullTe= xt search and Information Retrieval background I find myself in somewhat = of a new territory.

But wit= hin that new territory there is a link to my area of expertise: Lucene/So= lr in the rdf.cris package. This package turns out to be crucial for our = project and I would gladly participate and contribute my knowledge as a L= ucene and Solr developer. So here in a nutshell a list of =22small contri= butions=22 to start with: 

- Abstraction Refactoring
Currently CRIS is using Lucene as its =46T engine, but we = might want to eventually go to Solr (or elasticsearch for XYZ reasons). =46= irst step would be to remove all Lucene dependencies in rdf.cris package = and push implementation in rdf.cris.lucene package

- Lucene 4.x Branch
There are a large number= of changes since the 2.x and 3.x branch of Lucene. I'd propose a small r= efactor and overhaul of the rdf.cris.lucene package to take advantage of = Lucene's new features (=46acets, SearchManager, =E2=80=A6)

=
- Solr Implementation
In line with =22in production=22 I s= trongly believe clerezza's CRIS component should be able to leverage esta= blished services without having to manage their scalability. That goes fo= r =46ullText Search most obviously. The idea is to be able to use a remot= e Solr Server (Solr since it comes with the whole pseudo-rest servicing o= n top of Lucene).

- =46ine Grained Search
<= div>As a logical evolution from the points above, it would be then perfec= t if clerezza's fulltext search capabilities could benefit from all the f= eatures of Lucene/Solr. I am especially thinking about: 
-= - =46ield/Analyzer specialisation (we don't compare authors, dates and te= xt in the same way in Lucene/Solr)
-- Boosting (=46or IR, the t= itle of a document usually yields more important information than its foo= tnotes)
-- Advanced facets (things like date-rage facets, pivot= facets (called 2nd level facets in fusepool))
-- Geolocalised = searches (big thing in Lucene/Solr 4.x branch=E2=80=A6 would eventually b= e a nice to have)

I will execute this work over = the next few weeks/months as part of the fusepool project, but most of al= l I would be pleased and interested to finally get a top-notch implementa= tion of cross rdf-text solution. Very much looking forward for your feedb= ack and hopefully support ;)

PS: who ever initia= ted the GraphIndexer implementation did an excellent job=21 Will hopefull= y follow in his footsteps=21 

Cheers, =

=5FStephane

--524c27eb_2ca88611_1c43-- --524c27eb_71f32454_1c43-- --524c27eb_54e49eb4_1c43--