clerezza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Search in rdf.cris
Date Mon, 07 Oct 2013 06:59:30 GMT
Hi Stephane,

sorry for the late response.

2013/10/3 Stephane Gamard <stephane@gamard.net>

> Thank you Tommaso,
>
> I might need help or at the very least simple pointers and debates over
> certain principles and guidelines.
>
> First one being: the choice to either abstract everything related to
> search (such as Sorting fields, query, filters and facets) or to use the
> Lucene native objects. Small overview of pros and cons (for the rdf.cris
> package, not the implemenation packages).
>

yes that's usually one of the biggest challenges when search is not part of
the core architecture infrastructure. I tend to prefer the more abstract
way of doing things, with an eye on having generic yet flexible APIs as
most as possible. At the same time having a number of use cases and
implementation features that one wants to leverage may be a good drive for
designing such APIs.


>
> *Native Lucene*
> + Objects already exists, well implemented (SortField, Facet, …)
> - Bounds to lucene semantics (fairly easy to use but certain impl
> providers will have to rewrite using Lucene translation… In case someone
> wants to make a "Fast" or GSA impl for clerezza). Note that Lucene, Solr
> and Elastic can fairly easily work with Native Lucene Objects
> +/- Should put all search-ability logic into helper classes as to not
> force external package to talk "Lucene"
>
> *Abstracted Classes*
> - LOT of re-coding concepts that are straight forward in Lucene
> + No Lucene dependancies and no need of helper classes
> + Not bound to anything impl, rewrite for possible solr, GSA, fast, … will
> not require basic knowledge of Lucene.
>
> I'd be interested on you POV on this. My Main goal is for ppl outside of
> the rdf.cris package never having to learn any specialised API while yet
> taking advantage of all the IR features of any search engine.
>
>
I think this last requirement goes in the direction of more abstract design.
Maybe a good compromise for starting would be sketching up an API, extend /
implement a couple of use cases with Lucene, enhance the API, and iterate a
bunch of times till we're satisfied with it.

My 2 cents,
Tommaso


> _Stephane
>
>
> On October 3, 2013 at 1:59:07 PM, Tommaso Teofili (
> tommaso.teofili@gmail.com) wrote:
>
> Hi Stephane,
>
> I don't have much time now but I just wanted to let you know that IMHO
> your
> list of goals / tasks sounds completely reasonable, in case you need it I
> may be able to give some help along the next weeks.
>
> Regards,
> Tommaso
>
>
> 2013/10/2 Stephane Gamard <stephane@gamard.net>
>
> > Hi Team,
> >
> > My name's Stephane and I am currently participating to the Fusepool FP7
> > project. Within this project we are using stanbol & clerezza as key
> > architectural components. Coming from a pure FullText search and
> > Information Retrieval background I find myself in somewhat of a new
> > territory.
> >
> > But within that new territory there is a link to my area of expertise:
> > Lucene/Solr in the rdf.cris package. This package turns out to be
> crucial
> > for our project and I would gladly participate and contribute my
> knowledge
> > as a Lucene and Solr developer. So here in a nutshell a list of "small
> > contributions" to start with:
> >
> > - Abstraction Refactoring
> > Currently CRIS is using Lucene as its FT engine, but we might want to
> > eventually go to Solr (or elasticsearch for XYZ reasons). First step
> would
> > be to remove all Lucene dependencies in rdf.cris package and push
> > implementation in rdf.cris.lucene package
> >
> > - Lucene 4.x Branch
> > There are a large number of changes since the 2.x and 3.x branch of
> > Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene
> > package to take advantage of Lucene's new features (Facets,
> SearchManager,
> > …)
> >
> > - Solr Implementation
> > In line with "in production" I strongly believe clerezza's CRIS
> component
> > should be able to leverage established services without having to manage
> > their scalability. That goes for FullText Search most obviously. The
> idea
> > is to be able to use a remote Solr Server (Solr since it comes with the
> > whole pseudo-rest servicing on top of Lucene).
> >
> > - Fine Grained Search
> > As a logical evolution from the points above, it would be then perfect
> if
> > clerezza's fulltext search capabilities could benefit from all the
> features
> > of Lucene/Solr. I am especially thinking about:
> > -- Field/Analyzer specialisation (we don't compare authors, dates and
> text
> > in the same way in Lucene/Solr)
> > -- Boosting (For IR, the title of a document usually yields more
> important
> > information than its footnotes)
> > -- Advanced facets (things like date-rage facets, pivot facets (called
> 2nd
> > level facets in fusepool))
> > -- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would
> > eventually be a nice to have)
> >
> > I will execute this work over the next few weeks/months as part of the
> > fusepool project, but most of all I would be pleased and interested to
> > finally get a top-notch implementation of cross rdf-text solution. Very
> > much looking forward for your feedback and hopefully support ;)
> >
> > PS: who ever initiated the GraphIndexer implementation did an excellent
> > job! Will hopefully follow in his footsteps!
> >
> > Cheers,
> >
> > _Stephane
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message