cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <rwi...@fold3.com>
Subject Re: Schema questions for data structures with recently-modified access patterns
Date Tue, 21 Jul 2015 16:37:07 GMT
The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified
documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso <info@mrcalonso.com<mailto:info@mrcalonso.com>>
wrote:

Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the row, therefore
accessing it should be as easy as:

SELECT * FROM document WHERE docId == <the docId> LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can always expire
old ones using TTL or on a batch job. Tombstones will never be a problem in this case as,
due to the specified clustering order, the latest modification will always be first record
in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonso<https://twitter.com/calonso>

On 21 July 2015 at 05:59, Robert Wille <rwille@fold3.com<mailto:rwille@fold3.com>>
wrote:
Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra.
I’m wondering if any of you smart guys can provide suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
        docId UUID,
        doc TEXT,
        last_modified TIMEUUID,
        PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
        date TEXT,
        last_modified TIMEUUID,
        docId UUID,
        PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current record from
doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear
at most once in the doc_by_last_modified table, then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there is concurrent
access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access to a table?
This problem exists for many types of data structures, not just recently-modified. Any ordered
data structure that can be dynamically reordered suffers from the same problems. As I’ve
been doing schema design, this pattern keeps recurring. A nice way to address this problem
has lots of applications.

Thanks in advance for your thoughts

Robert




Mime
View raw message