pinot-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [pinot] atris commented on issue #7395: Support for Native Text Indexing in Pinot
Date Tue, 14 Sep 2021 09:15:05 GMT

atris commented on issue #7395:
URL: https://github.com/apache/pinot/issues/7395#issuecomment-918968144


   Thanks for reviewing the document, @siddharthteotia !
   
   Here are my thoughts:
   
   Current text search infrastructure: Status quo, we simply build side car Lucene indices
and expose a UDF which allows users to specify Lucene queries. IMO, this is a component that
should ideally be outside of Pinot since it has no correlation with Pinot itself.
   
   So, an eventual goal is to move text search to native Pinot indices and dictionary, and
follow the SQL Standard (LIKE operator) syntax as much as possible.
   
   Now, coming to the FST itself. There are three reasons as to why a native FST makes sense:
   
   1. Flexibility and Control -- Lucene is a full fledged search library. It is built for
generic text search use cases and consists of capabilities which allow ranked retrieval, norm
storage and impact filtering,  to name a few capabilities. None of these are of relevance
to us since we do not perform ranking. As I mentioned before, if we are building our text
search capabilities on top of Pinot data structures, then pulling in Lucene just for the FST
is an overkill, and also stops us from any potential changes that we may wish to do. Lucene's
FST is a generic engine, not optimized for our use cases (only dictionary IDs as output symbols,
primary query load being prefix and suffix matches from LIKE operator). Other improvements
may or may not come in later, but if we do not move to our native implementation, we remove
the possibility of any such improvements.
   
   2. Ability to perform Pinot specific optimizations -- As stated in the above point, it
is not possible for us to do specific changes/enhancements. For e.g., it should be possible
to short circuit the evaluation of regular expressions ending with match-all and having a
short prefix before the same, thus accelerating a common use case of LIKE operator.
   
   3. Realtime Capabilities -- Lucene builds FST during segment flush, thus forcing us to
flush frequently. Also, this inhibits us from doing real time text search, which is a limitation.
With a native FST implementation,  we should be able to explore this path.
   
   Regarding TEXT_MATCH, while it is my dearest wish to deprecate the module, I understand
that some users may wish to use it. As highlighted, both indices can co exist, with no mandate
to migrate to one over the other.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Mime
View raw message