Return-Path: X-Original-To: apmail-clerezza-dev-archive@www.apache.org Delivered-To: apmail-clerezza-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 777FB1049A for ; Thu, 3 Oct 2013 11:59:10 +0000 (UTC) Received: (qmail 47658 invoked by uid 500); 3 Oct 2013 11:59:06 -0000 Delivered-To: apmail-clerezza-dev-archive@clerezza.apache.org Received: (qmail 47565 invoked by uid 500); 3 Oct 2013 11:58:58 -0000 Mailing-List: contact dev-help@clerezza.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@clerezza.apache.org Delivered-To: mailing list dev@clerezza.apache.org Received: (qmail 47555 invoked by uid 99); 3 Oct 2013 11:58:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Oct 2013 11:58:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tommaso.teofili@gmail.com designates 209.85.220.41 as permitted sender) Received: from [209.85.220.41] (HELO mail-pa0-f41.google.com) (209.85.220.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Oct 2013 11:58:50 +0000 Received: by mail-pa0-f41.google.com with SMTP id bj1so2509788pad.0 for ; Thu, 03 Oct 2013 04:58:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=BcPdqWKW0hGzZFn4WS2YNQZhiIq/5FLihSRRahOxyHE=; b=wuv8dBmREC5CyWKUGwQ4S/YV1DRYmQ/PGsUTlnpo/IwtBc66FL0MEeydPJNmkF8iic SvuWYKHHQioFYPlXgBp1dKggUEGEqADTUE5wxBI4rLROMLJGOP9IDJHqECfm7dMK3rHC lWzIQsW4x8Zxmp3Tu+uEo/fC/LEib8msrL1gF3t1XAkG96VztGJti+00Bs5iBcMYeeIi cOrLFihdQYfS8LV3nxoSubnbVGolFRYd+0ucl/R6Qc9dTWA2NShnF9mUENhiCmb3+QrQ hnyAE/x/wjFGBZozMVPoQcqNFFR0moW81ed9TMdebwjCXQaZ4LY0t9z4flUs4BPCmS0T gMOQ== X-Received: by 10.68.134.98 with SMTP id pj2mr28585pbb.110.1380801509244; Thu, 03 Oct 2013 04:58:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.82.134 with HTTP; Thu, 3 Oct 2013 04:57:48 -0700 (PDT) In-Reply-To: References: From: Tommaso Teofili Date: Thu, 3 Oct 2013 13:57:48 +0200 Message-ID: Subject: Re: Search in rdf.cris To: dev@clerezza.apache.org Content-Type: multipart/alternative; boundary=047d7b10cf4d76c86b04e7d4e743 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b10cf4d76c86b04e7d4e743 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Stephane, I don't have much time now but I just wanted to let you know that IMHO your list of goals / tasks sounds completely reasonable, in case you need it I may be able to give some help along the next weeks. Regards, Tommaso 2013/10/2 Stephane Gamard > Hi Team, > > My name's Stephane and I am currently participating to the Fusepool FP7 > project. Within this project we are using stanbol & clerezza as key > architectural components. Coming from a pure FullText search and > Information Retrieval background I find myself in somewhat of a new > territory. > > But within that new territory there is a link to my area of expertise: > Lucene/Solr in the rdf.cris package. This package turns out to be crucial > for our project and I would gladly participate and contribute my knowledg= e > as a Lucene and Solr developer. So here in a nutshell a list of "small > contributions" to start with: > > - Abstraction Refactoring > Currently CRIS is using Lucene as its FT engine, but we might want to > eventually go to Solr (or elasticsearch for XYZ reasons). First step woul= d > be to remove all Lucene dependencies in rdf.cris package and push > implementation in rdf.cris.lucene package > > - Lucene 4.x Branch > There are a large number of changes since the 2.x and 3.x branch of > Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene > package to take advantage of Lucene's new features (Facets, SearchManager= , > =85) > > - Solr Implementation > In line with "in production" I strongly believe clerezza's CRIS component > should be able to leverage established services without having to manage > their scalability. That goes for FullText Search most obviously. The idea > is to be able to use a remote Solr Server (Solr since it comes with the > whole pseudo-rest servicing on top of Lucene). > > - Fine Grained Search > As a logical evolution from the points above, it would be then perfect if > clerezza's fulltext search capabilities could benefit from all the featur= es > of Lucene/Solr. I am especially thinking about: > -- Field/Analyzer specialisation (we don't compare authors, dates and tex= t > in the same way in Lucene/Solr) > -- Boosting (For IR, the title of a document usually yields more importan= t > information than its footnotes) > -- Advanced facets (things like date-rage facets, pivot facets (called 2n= d > level facets in fusepool)) > -- Geolocalised searches (big thing in Lucene/Solr 4.x branch=85 would > eventually be a nice to have) > > I will execute this work over the next few weeks/months as part of the > fusepool project, but most of all I would be pleased and interested to > finally get a top-notch implementation of cross rdf-text solution. Very > much looking forward for your feedback and hopefully support ;) > > PS: who ever initiated the GraphIndexer implementation did an excellent > job! Will hopefully follow in his footsteps! > > Cheers, > > _Stephane > > --047d7b10cf4d76c86b04e7d4e743--