clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Gamard <steph...@gamard.net>
Subject Search in rdf.cris
Date Wed, 02 Oct 2013 14:04:27 GMT
Hi Team, 

My name's Stephane and I am currently participating to the Fusepool FP7 project. Within this
project we are using stanbol & clerezza as key architectural components. Coming from a
pure FullText search and Information Retrieval background I find myself in somewhat of a new
territory.

But within that new territory there is a link to my area of expertise: Lucene/Solr in the
rdf.cris package. This package turns out to be crucial for our project and I would gladly
participate and contribute my knowledge as a Lucene and Solr developer. So here in a nutshell
a list of "small contributions" to start with: 

- Abstraction Refactoring
Currently CRIS is using Lucene as its FT engine, but we might want to eventually go to Solr
(or elasticsearch for XYZ reasons). First step would be to remove all Lucene dependencies
in rdf.cris package and push implementation in rdf.cris.lucene package

- Lucene 4.x Branch
There are a large number of changes since the 2.x and 3.x branch of Lucene. I'd propose a
small refactor and overhaul of the rdf.cris.lucene package to take advantage of Lucene's new
features (Facets, SearchManager, …)

- Solr Implementation
In line with "in production" I strongly believe clerezza's CRIS component should be able to
leverage established services without having to manage their scalability. That goes for FullText
Search most obviously. The idea is to be able to use a remote Solr Server (Solr since it comes
with the whole pseudo-rest servicing on top of Lucene).

- Fine Grained Search
As a logical evolution from the points above, it would be then perfect if clerezza's fulltext
search capabilities could benefit from all the features of Lucene/Solr. I am especially thinking
about: 
-- Field/Analyzer specialisation (we don't compare authors, dates and text in the same way
in Lucene/Solr)
-- Boosting (For IR, the title of a document usually yields more important information than
its footnotes)
-- Advanced facets (things like date-rage facets, pivot facets (called 2nd level facets in
fusepool))
-- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would eventually be a nice
to have)

I will execute this work over the next few weeks/months as part of the fusepool project, but
most of all I would be pleased and interested to finally get a top-notch implementation of
cross rdf-text solution. Very much looking forward for your feedback and hopefully support
;)

PS: who ever initiated the GraphIndexer implementation did an excellent job! Will hopefully
follow in his footsteps! 

Cheers, 

_Stephane
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
    • Unnamed multipart/alternative (inline, None, 0 bytes)
      • Unnamed multipart/related (inline, None, 0 bytes)
View raw message