jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vikas Saurabh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-3336) Abstract a full text index implementation to be extended by Lucene and Solr
Date Thu, 05 Apr 2018 03:53:00 GMT

    [ https://issues.apache.org/jira/browse/OAK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426448#comment-16426448

Vikas Saurabh commented on OAK-3336:

These most likely would call for separate issue/tasks.... but, it would be useful to remember
what we ([~teofili], [~tmueller] and I) discussed off-list in a brain storming session:
h4. Index definitions
* can likely be common for most parts
* analyzer – probably specific to lucene??
** even if some (say solr or ES) allow for different definitions use different analyzers -
but the concept might not be generic for oak-search module
* tika – common
* aggregates – common
* property definitions – common, except below??
** Suggestion
** Spellcheck
** Facet
** Excerpt
** Function index - probably common as function indexes are essentially just providing value
to be indexed

h4. Editor
* When to index – most likely common as values affecting state change should be independent
of index provider in play
* What to index – most likely common except for following??
** Spellcheck
** Suggestion
** Facet
** Excerpt
** Custom Field provider – common
*** needs to be made independed of lucene Fields though
*** how to deprecate current SPI?
** How to index – has to be custom for each index provider

h4. Sync indexing
should be common as its storage is node state based. Sync indexed data doesn't go to centrally
indexed async information

h4. NRT
* should be common similar to sync indexing above
* BUT would require oak-search to use lucene (which might be debatable)

h4. CoR/CoW or counterparts
* custom on need basis
* most likely relevant only for lucene indexes but utilities could be useful to support different
lucene versions (is that a goal??)

h4. Query
* Index selection – can I answer this query
** common
** this would be part of planner which checks index definition to see if the index can answer
a give query
* Cost estimation
** needs to be custom as it's highly tied to "how a given indexer indexes data" AND "how costly
would it be to get a good fast estimate"
** some parts might be common like
*** how many unique values does a given constraint have
*** what's the worst case result count (maybe backed by node counter) in case concrete implementation
can't get that information in a fast manner
* Custom query terms provider – can be common (similar to Custom field provider)
** needs to be made independent of Lucene Query
** how to deprecate current SPI?
* Low level query – needs to be custom
** But, maybe, we can utilize current form of LuceneProperyIndex and translate LuceneQuery
AST to underlying engine’s query

h4. Text extraction + tika configuration - common
* as similar to function index, this is about generating data to index
* should we allow for some implementations that might be interested in doing their own extractions?
** maybe with a caution that "external text extraction is out of control - so expectation
of extraction feature parity is implicitly undefined"

h4. Tests
* Lucene tests have pretty decent coverage
* Since the idea is to abstract as much stuff as possible, so, any test that’s querying
and verifying result should be parametrized to all index providers (all = lucene, solr, ES,

> Abstract a full text index implementation to be extended by Lucene and Solr
> ---------------------------------------------------------------------------
>                 Key: OAK-3336
>                 URL: https://issues.apache.org/jira/browse/OAK-3336
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene, query, solr
>            Reporter: Tommaso Teofili
>            Assignee: Tommaso Teofili
>            Priority: Major
>             Fix For: 1.10
> Current Lucene and Solr indexes implement quite a no. of features according to their
specific APIs, design and implementation. However in the long run, while differences in APIs
and implementations will / can of course stay, the difference in design can make it hard to
keep those features on par.
> It'd be therefore nice to make it possible to abstract as much of design and implementation
bits as possible in an abstract full text implementation which Lucene and Solr would extend
according to their specifics.
> An example advantage of this is that index time aggregation will be implemented only
once and therefore any bugfixes and improvements in that area will be done in the abstract
implementation rather than having to do that in two places.

This message was sent by Atlassian JIRA

View raw message