Mailing-List: contact dev-help@clerezza.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@clerezza.apache.org
Received-SPF: pass (nike.apache.org: domain of tommaso.teofili@gmail.com
 designates 209.85.220.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <etPan.524c27eb.189a769b.1c43@dhcp-10-0-1-108.searchbox.lan>
References: <etPan.524c27eb.189a769b.1c43@dhcp-10-0-1-108.searchbox.lan>
From: Tommaso Teofili <tommaso.teofili@gmail.com>
Date: Thu, 3 Oct 2013 13:57:48 +0200
Message-ID: 
 <CAGnSx07uMJASKDVYLYaioPWjssj6tB4nwbxCsiwXCm6DODfRmQ@mail.gmail.com>
Subject: Re: Search in rdf.cris
To: dev@clerezza.apache.org
Content-Type: multipart/alternative; boundary=047d7b10cf4d76c86b04e7d4e743

--047d7b10cf4d76c86b04e7d4e743
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Stephane,

I don't have much time now but I just wanted to let you know that IMHO your
list of goals / tasks sounds completely reasonable, in case you need it I
may be able to give some help along the next weeks.

Regards,
Tommaso


2013/10/2 Stephane Gamard <stephane@gamard.net>

> Hi Team,
>
> My name's Stephane and I am currently participating to the Fusepool FP7
> project. Within this project we are using stanbol & clerezza as key
> architectural components. Coming from a pure FullText search and
> Information Retrieval background I find myself in somewhat of a new
> territory.
>
> But within that new territory there is a link to my area of expertise:
> Lucene/Solr in the rdf.cris package. This package turns out to be crucial
> for our project and I would gladly participate and contribute my knowledg=
e
> as a Lucene and Solr developer. So here in a nutshell a list of "small
> contributions" to start with:
>
> - Abstraction Refactoring
> Currently CRIS is using Lucene as its FT engine, but we might want to
> eventually go to Solr (or elasticsearch for XYZ reasons). First step woul=
d
> be to remove all Lucene dependencies in rdf.cris package and push
> implementation in rdf.cris.lucene package
>
> - Lucene 4.x Branch
> There are a large number of changes since the 2.x and 3.x branch of
> Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene
> package to take advantage of Lucene's new features (Facets, SearchManager=
,
> =85)
>
> - Solr Implementation
> In line with "in production" I strongly believe clerezza's CRIS component
> should be able to leverage established services without having to manage
> their scalability. That goes for FullText Search most obviously. The idea
> is to be able to use a remote Solr Server (Solr since it comes with the
> whole pseudo-rest servicing on top of Lucene).
>
> - Fine Grained Search
> As a logical evolution from the points above, it would be then perfect if
> clerezza's fulltext search capabilities could benefit from all the featur=
es
> of Lucene/Solr. I am especially thinking about:
> -- Field/Analyzer specialisation (we don't compare authors, dates and tex=
t
> in the same way in Lucene/Solr)
> -- Boosting (For IR, the title of a document usually yields more importan=
t
> information than its footnotes)
> -- Advanced facets (things like date-rage facets, pivot facets (called 2n=
d
> level facets in fusepool))
> -- Geolocalised searches (big thing in Lucene/Solr 4.x branch=85 would
> eventually be a nice to have)
>
> I will execute this work over the next few weeks/months as part of the
> fusepool project, but most of all I would be pleased and interested to
> finally get a top-notch implementation of cross rdf-text solution. Very
> much looking forward for your feedback and hopefully support ;)
>
> PS: who ever initiated the GraphIndexer implementation did an excellent
> job! Will hopefully follow in his footsteps!
>
> Cheers,
>
> _Stephane
>
>

--047d7b10cf4d76c86b04e7d4e743--