cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos Alonso <i...@mrcalonso.com>
Subject Re: Schema questions for data structures with recently-modified access patterns
Date Wed, 22 Jul 2015 09:04:42 GMT
Ah, so you your access pattern is to get all documents modified in a
particular date, right?

Then I think your approach is good, and to avoid duplication, why don't add
the docId as the first clustering column and remove the last_modified field
from it?
That way, your primary key would be PRIMARY KEY(date, docId), making all
docs modified in same day be together in the same partition, and on the
other hand, two updates on the same date won't generate a two rows as the
primary key would be exactly the same.

Does it make sense?

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 21 July 2015 at 18:37, Robert Wille <rwille@fold3.com> wrote:

>  The time series doesn’t provide the access pattern I’m looking for. No
> way to query recently-modified documents.
>
>  On Jul 21, 2015, at 9:13 AM, Carlos Alonso <info@mrcalonso.com> wrote:
>
>  Hi Robert,
>
>  What about modelling it as a time serie?
>
>  CREATE TABLE document (
>   docId UUID,
>   doc TEXT,
>   last_modified TIMESTAMP
>   PRIMARY KEY(docId, last_modified)
> ) WITH CLUSTERING ORDER BY (last_modified DESC);
>
>  This way, you the lastest modification will always be the first record
> in the row, therefore accessing it should be as easy as:
>
>  SELECT * FROM document WHERE docId == <the docId> LIMIT 1;
>
>  And, if you experience diskspace issues due to very long rows, then you
> can always expire old ones using TTL or on a batch job. Tombstones will
> never be a problem in this case as, due to the specified clustering order,
> the latest modification will always be first record in the row.
>
>  Hope it helps.
>
>  Carlos Alonso | Software Engineer | @calonso
> <https://twitter.com/calonso>
>
> On 21 July 2015 at 05:59, Robert Wille <rwille@fold3.com> wrote:
>
>> Data structures that have a recently-modified access pattern seem to be a
>> poor fit for Cassandra. I’m wondering if any of you smart guys can provide
>> suggestions.
>>
>> For the sake of discussion, lets assume I have the following tables:
>>
>> CREATE TABLE document (
>>         docId UUID,
>>         doc TEXT,
>>         last_modified TIMEUUID,
>>         PRIMARY KEY ((docid))
>> )
>>
>> CREATE TABLE doc_by_last_modified (
>>         date TEXT,
>>         last_modified TIMEUUID,
>>         docId UUID,
>>         PRIMARY KEY ((date), last_modified)
>> )
>>
>> When I update a document, I retrieve its last_modified time, delete the
>> current record from doc_by_last_modified, and add a new one. Unfortunately,
>> if you’d like each document to appear at most once in the
>> doc_by_last_modified table, then this doesn’t work so well.
>>
>> Documents can get into the doc_by_last_modified table multiple times if
>> there is concurrent access, or if there is a consistency issue.
>>
>> Any thoughts out there on how to efficiently provide recently-modified
>> access to a table? This problem exists for many types of data structures,
>> not just recently-modified. Any ordered data structure that can be
>> dynamically reordered suffers from the same problems. As I’ve been doing
>> schema design, this pattern keeps recurring. A nice way to address this
>> problem has lots of applications.
>>
>> Thanks in advance for your thoughts
>>
>> Robert
>>
>>
>
>

Mime
View raw message