cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <rwi...@fold3.com>
Subject Re: Schema questions for data structures with recently-modified access patterns
Date Tue, 21 Jul 2015 16:46:54 GMT
If last_modified is a clustering column, it needs a partitioning column, which is what date
is for (although I should have named it day, and I also forgot to add the order by desc clause).
This is essentially what I came up with. Still not liking how easy it is to get duplicates.

On Jul 21, 2015, at 9:31 AM, Jack Krupansky <jack.krupansky@gmail.com<mailto:jack.krupansky@gmail.com>>
wrote:

Keep the original document base table, but then the query table should have the PK as last_modified,
docId, with last_modified descending, so that a query can get the n most recently modified
documents.

Yes, you still need to manually delete the old entry for the document in the query table if
duplicates are a problem for you.

Yeah, a TTL would be good if you don't care about documents modified a month or a week ago.

-- Jack Krupansky

On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso <info@mrcalonso.com<mailto:info@mrcalonso.com>>
wrote:
Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the row, therefore
accessing it should be as easy as:

SELECT * FROM document WHERE docId == <the docId> LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can always expire
old ones using TTL or on a batch job. Tombstones will never be a problem in this case as,
due to the specified clustering order, the latest modification will always be first record
in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonso<https://twitter.com/calonso>

On 21 July 2015 at 05:59, Robert Wille <rwille@fold3.com<mailto:rwille@fold3.com>>
wrote:
Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra.
I’m wondering if any of you smart guys can provide suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
        docId UUID,
        doc TEXT,
        last_modified TIMEUUID,
        PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
        date TEXT,
        last_modified TIMEUUID,
        docId UUID,
        PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current record from
doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear
at most once in the doc_by_last_modified table, then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there is concurrent
access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access to a table?
This problem exists for many types of data structures, not just recently-modified. Any ordered
data structure that can be dynamically reordered suffers from the same problems. As I’ve
been doing schema design, this pattern keeps recurring. A nice way to address this problem
has lots of applications.

Thanks in advance for your thoughts

Robert





Mime
View raw message