pinot-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [pinot] siddharthteotia commented on issue #7395: Support for Native Text Indexing in Pinot
Date Fri, 17 Sep 2021 00:17:46 GMT

siddharthteotia commented on issue #7395:

   I had followed up for clarifying few additional things with @atris in slack channel. Copying
here for reference and visibility
   Can we all confirm the following ? I am sorry to have asked this couple of times as part
of different threads in the doc but since doc still indicates some sort of migration _Note
that till completion of phase 4,  we will be maintaining the existing text indices within
Pinot_. I just want to make sure
   - Existing Lucene text index functionality offered via TEXT_MATCH will continue to work
as is and is essentially untouched by this work
   - Both indexes can co-exist and we are not removing Lucene dependency ?
   - Upon segment reload, existing Lucene index can potentially be converted to new format
(if need be). However, if someone wishes to do this, how will the query syntax used in TEXT_MATCH
from lucene based remain compliant for native FST index (which I believe will follow SQL LIKE
semantics). I am guessing the users will have to change queries if they wish to migrate ?
   - For the native FST index, the plan is to eventually support all kinds of searches --
phrase, term, regex, fuzzy etc. So for example, phrase search needs position info which I
am not sure if it comes for free as part of FST. Regardless, all of that is the end state
and comprehensive text search functionality will be available through this native index ?
   - - This is important for us because eventually (and this is a big eventual for us :slightly_smiling_face:
) we might want to migrate our production Li users from Lucene text index to native FST index
if performance is better. I can't promise if that will happen as it will certainly be a lot
of work (hence seeking confirmation that we are not removing anything). Our production users
use a lot of phrase queries.
   - General question - are you planning to make this functionality available both via LIKE
and TEXT_MATCH or want to keep it separate and just use LIKE ? Latter can also be overloaded
as long as user docs clearly indicate that TEXT_MATCH can be used for both native and lucene
text index
   - Request on code - since FST is like a black box (for me except for whatever I learned
from paper and online presentations), can you please make sure that code is sufficiently documented
and explains algorithm as and when needed. Initially, we were just relying on Lucene committers
but now we will have to maintain. This will also help with easy review
   @atris 's response
   - Yes, Lucene Indices and TEXT_MATCH will be completely untouched and unaffected by this
   - No, we are not removing the dependency and both indices can coexist, oblivious of each
   - Here is the interesting one. Native FST can support all queries that Lucene does. However,
since our indices do not store some metadata (such as positional index) that Lucene Indices
do, we will have to implement custom operators on top of native FST. However, syntactically,
native FST shall pose no challenges in that implementation. If there are specific operators
outside of the four planned currently (regexp, like, phrase and fuzzy) that will be needed
for users to migrate, I will be more than happy to support.
   - Yes, in the end state, comprehensive text search will be natively available.
   - I was actually not planning to overload TEXT_MATCH since it basically supports Lucene
syntax, but rather have custom functions for phrase, fuzzy and regexp, and let the LIKE operator
deal with the rest. However, there is no reason why we can't go down that route.
   - I completely agree. I have tried to document the code as elaborately as possible and
also written supporting documents (e.g. On the Regexp compilation process). If there is more
needed on specific areas, I will gladly write more :)
   Based on above clarifications, I am ok with proceeding 
   @amrishlal , @jackjlli  please feel free to add any additional discussion notes 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail:

For queries about this service, please contact Infrastructure at:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message