cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <doanduy...@gmail.com>
Subject Re: Efficient model for a sorting
Date Tue, 04 Oct 2016 13:23:42 GMT
MV build is also async.

In the end it's MV maintenance cost vs Lucene index maintenance cost. I
don't have clear figure to judge which one is better. Maybe you should
benchmark yourself. Anyway I'll be interested by the results

On Tue, Oct 4, 2016 at 3:05 PM, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:

> On lucene you can query+filter+sort on a single shard, so it should be
> better than MV/sasi. The index building is a little async though.
>
> On Tue, Oct 4, 2016 at 2:29 PM, Benjamin Roth <benjamin.roth@jaumo.com>
> wrote:
>
>> Thanks guys!
>>
>> Good to know, that my approach is basically right, but I will check that
>> lucene indices by time.
>>
>> 2016-10-04 14:22 GMT+02:00 DuyHai Doan <doanduyhai@gmail.com>:
>>
>>> "What scatter/gather? "
>>>
>>> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-f
>>> ull-text-search-ride-voxxed-daybelgrade-2016/23
>>>
>>> "If you partition your data by user_id then you query only 1 shard to
>>> get sorted by time visitors for a user"
>>>
>>> Exact, but in this case, you're using a 2nd index only for sorting right
>>> ? For SASI it's not even possible. Maybe it can work with Statrio Lucene
>>> impl
>>>
>>> On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha <dorian.hoxha@gmail.com>
>>> wrote:
>>>
>>>> @DuyHai
>>>>
>>>> What scatter/gather? If you partition your data by user_id then you
>>>> query only 1 shard to get sorted by time visitors for a user.
>>>>
>>>> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan <doanduyhai@gmail.com>
>>>> wrote:
>>>>
>>>>> MV is right now your best choice for this kind of sorting behavior.
>>>>>
>>>>> Secondary index (whatever the impl, SASI or Lucene) has a cost of
>>>>> scatter-gather if your cluster scale out. With MV you're at least
>>>>> guaranteed to hit a single node everytime
>>>>>
>>>>> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha <dorian.hoxha@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Can you use the lucene index https://github.com/Stratio/cas
>>>>>> sandra-lucene-index ?
>>>>>>
>>>>>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth <
>>>>>> benjamin.roth@jaumo.com> wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I have a frequently used pattern which seems to be quite costly
in
>>>>>>> CS. The pattern is always the same: I have a unique key and a
sorting by a
>>>>>>> different field.
>>>>>>>
>>>>>>> To give an example, here a real life example from our model:
>>>>>>> CREATE TABLE visits.visits_in (
>>>>>>>     user_id int,
>>>>>>>     user_id_visitor int,
>>>>>>>     created timestamp,
>>>>>>>     PRIMARY KEY (user_id, user_id_visitor)
>>>>>>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>>>>>>
>>>>>>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>>>>>>>     SELECT user_id, created, user_id_visitor
>>>>>>>     FROM visits.visits_in
>>>>>>>     WHERE user_id IS NOT NULL AND created IS NOT NULL AND
>>>>>>> user_id_visitor IS NOT NULL
>>>>>>>     PRIMARY KEY (user_id, created, user_id_visitor)
>>>>>>>     WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>>>>>>
>>>>>>> This simply represents people, that visited my profile sorted
by
>>>>>>> date desc but only one entry per visitor.
>>>>>>> Other examples with the same pattern could be a whats-app-like
inbox
>>>>>>> where the last message of each sender is shown by date desc.
There are lots
>>>>>>> of examples for that pattern.
>>>>>>>
>>>>>>> E.g. in redis I'd just use a sorted set, where the key could
be like
>>>>>>> "visits_${user_id}", set key would be user_id_visitor and score
>>>>>>> the created timestamp.
>>>>>>> In MySQL I'd create the table with PK on user_id + user_id_visitor
>>>>>>> and create an index on user_id + created
>>>>>>> In C* i use an MV.
>>>>>>>
>>>>>>> Is this the most efficient approach?
>>>>>>> I also could have done this without an MV but then the situation
in
>>>>>>> our app would be far more complex.
>>>>>>> I know that denormalization is a common pattern in C* and I don't
>>>>>>> hesitate to use it but in this case, it is not as simple as it's
not an
>>>>>>> append-only case but updates have to be handled correctly.
>>>>>>> If it is the first visit of a user, it's that simple, just 2
inserts
>>>>>>> in base table + denormalized table. But on a 2nd or 3rd visit,
the 1st or
>>>>>>> 2nd visit has to be deleted from the denormalized table before.
Otherwise
>>>>>>> the visit would not be unique any more.
>>>>>>> Handling this case without an MV requires a lot more effort,
I guess
>>>>>>> even more effort than just using an MV.
>>>>>>> 1. You need kind of app-side locking to deal with race conditions
>>>>>>> 2. Read before write is required to determine if an old record
has
>>>>>>> to be deleted
>>>>>>> 3. At least CL_QUORUM is required to make sure that read before
>>>>>>> write is always consistent
>>>>>>> 4. Old record has to be deleted on update
>>>>>>>
>>>>>>> I guess, using an MV here is more efficient as there is less
>>>>>>> roundtrip between C* and the app to do all that and the MV does
not require
>>>>>>> strong consistency as MV updates are always local and are eventual
>>>>>>> consistent when the base table is. So there is also no need for
distributed
>>>>>>> locks.
>>>>>>>
>>>>>>> I ask all this as we now use CS 3.x and have been advised that
3.x
>>>>>>> is still not considered really production ready.
>>>>>>>
>>>>>>> I guess in a perfect world, this wouldn't even require an MV
if SASI
>>>>>>> indexes could be created over more than 1 column. E.g. in MySQL
this case
>>>>>>> is nothing else than a BTree. AFAIK SASI indices are also BTrees,
filtering
>>>>>>> by Partition Key (which should to be done anyway) and sorting
by a field
>>>>>>> would perfectly do the trick. But from the docs, this is not
possible right
>>>>>>> now.
>>>>>>>
>>>>>>> Does anyone see a better solution or are all my assumptions correct?
>>>>>>>
>>>>>>> --
>>>>>>> Benjamin Roth
>>>>>>> Prokurist
>>>>>>>
>>>>>>> Jaumo GmbH · www.jaumo.com
>>>>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>>>>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>>>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>

Mime
View raw message