cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: [DISCUSS] CEP-7 Storage Attached Index
Date Fri, 21 Aug 2020 17:00:37 GMT
> About space efficiency, one of the biggest drawback of SASI was the huge
space required for index structure when using CONTAINS logic because of the
decomposition of text columns into n-grams. Will SAI suffer from the same
issue in future iterations ?

SAI does not have specific ngram support atm, though that may be added
with tokenizers.  Ngrams do indeed grow the index, that's a user
decision for faster queries or more disk space.

On Tue, Aug 18, 2020 at 6:05 AM DuyHai Doan <doanduyhai@gmail.com> wrote:
>
> Thank you Zhao Yang for starting this topic
>
> After reading the short design doc, I have a few questions
>
> 1) SASI was pretty inefficient indexing wide partitions because the index
> structure only retains the partition token, not the clustering colums. As
> per design doc SAI has row id mapping to partition offset, can we hope that
> indexing wide partition will be more efficient with SAI ? One detail that
> worries me is that in the beggining of the design doc, it is said that the
> matching rows are post filtered while scanning the partition. Can you
> confirm or infirm that SAI is efficient with wide partitions and provides
> the partition offsets to the matching rows ?
>
> 2) About space efficiency, one of the biggest drawback of SASI was the huge
> space required for index structure when using CONTAINS logic because of the
> decomposition of text columns into n-grams. Will SAI suffer from the same
> issue in future iterations ? I'm anticipating a bit
>
> 3) If I'm querying using SAI and providing complete partition key, will it
> be more efficient than querying without partition key. In other words, does
> SAI provide any optimisation when partition key is specified ?
>
> Regards
>
> Duy Hai DOAN
>
> Le mar. 18 août 2020 à 11:39, Mick Semb Wever <mck@apache.org> a écrit :
>
> > >
> > > We are looking forward to the community's feedback and suggestions.
> > >
> >
> >
> > What comes immediately to mind is testing requirements. It has been
> > mentioned already that the project's testability and QA guidelines are
> > inadequate to successfully introduce new features and refactorings to the
> > codebase. During the 4.0 beta phase this was intended to be addressed, i.e.
> > defining more specific QA guidelines for 4.0-rc. This would be an important
> > step towards QA guidelines for all changes and CEPs post-4.0.
> >
> > Questions from me
> >  - How will this be tested, how will its QA status and lifecycle be
> > defined? (per above)
> >  - With existing C* code needing to be changed, what is the proposed plan
> > for making those changes ensuring maintained QA, e.g. is there separate QA
> > cycles planned for altering the SPI before adding a new SPI implementation?
> >  - Despite being out of scope, it would be nice to have some idea from the
> > CEP author of when users might still choose afresh 2i or SASI over SAI,
> >  - Who fills the roles involved? Who are the contributors in this DataStax
> > team? Who is the shepherd? Are there other stakeholders willing to be
> > involved?
> >  - Is there a preference to use gdoc instead of the project's wiki, and
> > why? (the CEP process suggest a wiki page, and feedback on why another
> > approach is considered better helps evolve the CEP process itself)
> >
> > cheers,
> > Mick
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Mime
View raw message