cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Roth <benjamin.r...@jaumo.com>
Subject Re: Efficient model for a sorting
Date Tue, 04 Oct 2016 12:29:33 GMT
Thanks guys!

Good to know, that my approach is basically right, but I will check that
lucene indices by time.

2016-10-04 14:22 GMT+02:00 DuyHai Doan <doanduyhai@gmail.com>:

> "What scatter/gather? "
>
> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-
> the-full-text-search-ride-voxxed-daybelgrade-2016/23
>
> "If you partition your data by user_id then you query only 1 shard to get
> sorted by time visitors for a user"
>
> Exact, but in this case, you're using a 2nd index only for sorting right ?
> For SASI it's not even possible. Maybe it can work with Statrio Lucene impl
>
> On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha <dorian.hoxha@gmail.com>
> wrote:
>
>> @DuyHai
>>
>> What scatter/gather? If you partition your data by user_id then you query
>> only 1 shard to get sorted by time visitors for a user.
>>
>> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan <doanduyhai@gmail.com> wrote:
>>
>>> MV is right now your best choice for this kind of sorting behavior.
>>>
>>> Secondary index (whatever the impl, SASI or Lucene) has a cost of
>>> scatter-gather if your cluster scale out. With MV you're at least
>>> guaranteed to hit a single node everytime
>>>
>>> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha <dorian.hoxha@gmail.com>
>>> wrote:
>>>
>>>> Can you use the lucene index https://github.com/Stratio/cas
>>>> sandra-lucene-index ?
>>>>
>>>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth <benjamin.roth@jaumo.com>
>>>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I have a frequently used pattern which seems to be quite costly in CS.
>>>>> The pattern is always the same: I have a unique key and a sorting by
a
>>>>> different field.
>>>>>
>>>>> To give an example, here a real life example from our model:
>>>>> CREATE TABLE visits.visits_in (
>>>>>     user_id int,
>>>>>     user_id_visitor int,
>>>>>     created timestamp,
>>>>>     PRIMARY KEY (user_id, user_id_visitor)
>>>>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>>>>
>>>>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>>>>>     SELECT user_id, created, user_id_visitor
>>>>>     FROM visits.visits_in
>>>>>     WHERE user_id IS NOT NULL AND created IS NOT NULL AND
>>>>> user_id_visitor IS NOT NULL
>>>>>     PRIMARY KEY (user_id, created, user_id_visitor)
>>>>>     WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>>>>
>>>>> This simply represents people, that visited my profile sorted by date
>>>>> desc but only one entry per visitor.
>>>>> Other examples with the same pattern could be a whats-app-like inbox
>>>>> where the last message of each sender is shown by date desc. There are
lots
>>>>> of examples for that pattern.
>>>>>
>>>>> E.g. in redis I'd just use a sorted set, where the key could be like
>>>>> "visits_${user_id}", set key would be user_id_visitor and score
>>>>> the created timestamp.
>>>>> In MySQL I'd create the table with PK on user_id + user_id_visitor and
>>>>> create an index on user_id + created
>>>>> In C* i use an MV.
>>>>>
>>>>> Is this the most efficient approach?
>>>>> I also could have done this without an MV but then the situation in
>>>>> our app would be far more complex.
>>>>> I know that denormalization is a common pattern in C* and I don't
>>>>> hesitate to use it but in this case, it is not as simple as it's not
an
>>>>> append-only case but updates have to be handled correctly.
>>>>> If it is the first visit of a user, it's that simple, just 2 inserts
>>>>> in base table + denormalized table. But on a 2nd or 3rd visit, the 1st
or
>>>>> 2nd visit has to be deleted from the denormalized table before. Otherwise
>>>>> the visit would not be unique any more.
>>>>> Handling this case without an MV requires a lot more effort, I guess
>>>>> even more effort than just using an MV.
>>>>> 1. You need kind of app-side locking to deal with race conditions
>>>>> 2. Read before write is required to determine if an old record has to
>>>>> be deleted
>>>>> 3. At least CL_QUORUM is required to make sure that read before write
>>>>> is always consistent
>>>>> 4. Old record has to be deleted on update
>>>>>
>>>>> I guess, using an MV here is more efficient as there is less roundtrip
>>>>> between C* and the app to do all that and the MV does not require strong
>>>>> consistency as MV updates are always local and are eventual consistent
when
>>>>> the base table is. So there is also no need for distributed locks.
>>>>>
>>>>> I ask all this as we now use CS 3.x and have been advised that 3.x is
>>>>> still not considered really production ready.
>>>>>
>>>>> I guess in a perfect world, this wouldn't even require an MV if SASI
>>>>> indexes could be created over more than 1 column. E.g. in MySQL this
case
>>>>> is nothing else than a BTree. AFAIK SASI indices are also BTrees, filtering
>>>>> by Partition Key (which should to be done anyway) and sorting by a field
>>>>> would perfectly do the trick. But from the docs, this is not possible
right
>>>>> now.
>>>>>
>>>>> Does anyone see a better solution or are all my assumptions correct?
>>>>>
>>>>> --
>>>>> Benjamin Roth
>>>>> Prokurist
>>>>>
>>>>> Jaumo GmbH · www.jaumo.com
>>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Mime
View raw message