clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Gamard <steph...@gamard.net>
Subject Re: Search in rdf.cris
Date Thu, 03 Oct 2013 13:24:07 GMT
Thank you Tommaso, 

I might need help or at the very least simple pointers and debates over certain principles
and guidelines. 

First one being: the choice to either abstract everything related to search (such as Sorting
fields, query, filters and facets) or to use the Lucene native objects. Small overview of
pros and cons (for the rdf.cris package, not the implemenation packages). 

Native Lucene
+ Objects already exists, well implemented (SortField, Facet, …)
- Bounds to lucene semantics (fairly easy to use but certain impl providers will have to rewrite
using Lucene translation… In case someone wants to make a "Fast" or GSA impl for clerezza).
Note that Lucene, Solr and Elastic can fairly easily work with Native Lucene Objects
+/- Should put all search-ability logic into helper classes as to not force external package
to talk "Lucene"

Abstracted Classes
- LOT of re-coding concepts that are straight forward in Lucene
+ No Lucene dependancies and no need of helper classes
+ Not bound to anything impl, rewrite for possible solr, GSA, fast, … will not require basic
knowledge of Lucene.

I'd be interested on you POV on this. My Main goal is for ppl outside of the rdf.cris package
never having to learn any specialised API while yet taking advantage of all the IR features
of any search engine.

_Stephane


On October 3, 2013 at 1:59:07 PM, Tommaso Teofili (tommaso.teofili@gmail.com) wrote:

Hi Stephane,  

I don't have much time now but I just wanted to let you know that IMHO your  
list of goals / tasks sounds completely reasonable, in case you need it I  
may be able to give some help along the next weeks.  

Regards,  
Tommaso  


2013/10/2 Stephane Gamard <stephane@gamard.net>  

> Hi Team,  
>  
> My name's Stephane and I am currently participating to the Fusepool FP7  
> project. Within this project we are using stanbol & clerezza as key  
> architectural components. Coming from a pure FullText search and  
> Information Retrieval background I find myself in somewhat of a new  
> territory.  
>  
> But within that new territory there is a link to my area of expertise:  
> Lucene/Solr in the rdf.cris package. This package turns out to be crucial  
> for our project and I would gladly participate and contribute my knowledge  
> as a Lucene and Solr developer. So here in a nutshell a list of "small  
> contributions" to start with:  
>  
> - Abstraction Refactoring  
> Currently CRIS is using Lucene as its FT engine, but we might want to  
> eventually go to Solr (or elasticsearch for XYZ reasons). First step would  
> be to remove all Lucene dependencies in rdf.cris package and push  
> implementation in rdf.cris.lucene package  
>  
> - Lucene 4.x Branch  
> There are a large number of changes since the 2.x and 3.x branch of  
> Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene  
> package to take advantage of Lucene's new features (Facets, SearchManager,  
> …)  
>  
> - Solr Implementation  
> In line with "in production" I strongly believe clerezza's CRIS component  
> should be able to leverage established services without having to manage  
> their scalability. That goes for FullText Search most obviously. The idea  
> is to be able to use a remote Solr Server (Solr since it comes with the  
> whole pseudo-rest servicing on top of Lucene).  
>  
> - Fine Grained Search  
> As a logical evolution from the points above, it would be then perfect if  
> clerezza's fulltext search capabilities could benefit from all the features  
> of Lucene/Solr. I am especially thinking about:  
> -- Field/Analyzer specialisation (we don't compare authors, dates and text  
> in the same way in Lucene/Solr)  
> -- Boosting (For IR, the title of a document usually yields more important  
> information than its footnotes)  
> -- Advanced facets (things like date-rage facets, pivot facets (called 2nd  
> level facets in fusepool))  
> -- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would  
> eventually be a nice to have)  
>  
> I will execute this work over the next few weeks/months as part of the  
> fusepool project, but most of all I would be pleased and interested to  
> finally get a top-notch implementation of cross rdf-text solution. Very  
> much looking forward for your feedback and hopefully support ;)  
>  
> PS: who ever initiated the GraphIndexer implementation did an excellent  
> job! Will hopefully follow in his footsteps!  
>  
> Cheers,  
>  
> _Stephane  
>  
>
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
    • Unnamed multipart/alternative (inline, None, 0 bytes)
      • Unnamed multipart/related (inline, None, 0 bytes)
View raw message