Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E3D77184A9 for ; Wed, 22 Jul 2015 09:05:19 +0000 (UTC) Received: (qmail 48399 invoked by uid 500); 22 Jul 2015 09:05:15 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 48365 invoked by uid 500); 22 Jul 2015 09:05:15 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 48355 invoked by uid 99); 22 Jul 2015 09:05:15 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jul 2015 09:05:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3F1361A75E2 for ; Wed, 22 Jul 2015 09:05:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id RE6VzObixd48 for ; Wed, 22 Jul 2015 09:05:05 +0000 (UTC) Received: from setentaysiete103.serverov.com (setentaysiete103.serverov.com [188.93.77.103]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 8002E20EFC for ; Wed, 22 Jul 2015 09:05:04 +0000 (UTC) X-No-Relay: not in my network Received: from mail-la0-f51.google.com (mail-la0-f51.google.com [209.85.215.51]) by setentaysiete103.serverov.com (Postfix) with ESMTPSA id C3BFA1966 for ; Wed, 22 Jul 2015 11:04:42 +0200 (CEST) Received: by lagw2 with SMTP id w2so133886206lag.3 for ; Wed, 22 Jul 2015 02:04:42 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.112.56.139 with SMTP id a11mr1347870lbq.90.1437555882791; Wed, 22 Jul 2015 02:04:42 -0700 (PDT) Received: by 10.25.42.134 with HTTP; Wed, 22 Jul 2015 02:04:42 -0700 (PDT) In-Reply-To: <3EAD9346-7699-49AE-A7D0-7E6FA706C39E@fold3.com> References: <366A8B84-FB78-4DF4-B95B-B808219C0E3E@fold3.com> <3EAD9346-7699-49AE-A7D0-7E6FA706C39E@fold3.com> Date: Wed, 22 Jul 2015 11:04:42 +0200 Message-ID: Subject: Re: Schema questions for data structures with recently-modified access patterns From: Carlos Alonso To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a1133a352bd1b30051b730f4a --001a1133a352bd1b30051b730f4a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ah, so you your access pattern is to get all documents modified in a particular date, right? Then I think your approach is good, and to avoid duplication, why don't add the docId as the first clustering column and remove the last_modified field from it? That way, your primary key would be PRIMARY KEY(date, docId), making all docs modified in same day be together in the same partition, and on the other hand, two updates on the same date won't generate a two rows as the primary key would be exactly the same. Does it make sense? Carlos Alonso | Software Engineer | @calonso On 21 July 2015 at 18:37, Robert Wille wrote: > The time series doesn=E2=80=99t provide the access pattern I=E2=80=99m l= ooking for. No > way to query recently-modified documents. > > On Jul 21, 2015, at 9:13 AM, Carlos Alonso wrote: > > Hi Robert, > > What about modelling it as a time serie? > > CREATE TABLE document ( > docId UUID, > doc TEXT, > last_modified TIMESTAMP > PRIMARY KEY(docId, last_modified) > ) WITH CLUSTERING ORDER BY (last_modified DESC); > > This way, you the lastest modification will always be the first record > in the row, therefore accessing it should be as easy as: > > SELECT * FROM document WHERE docId =3D=3D LIMIT 1; > > And, if you experience diskspace issues due to very long rows, then you > can always expire old ones using TTL or on a batch job. Tombstones will > never be a problem in this case as, due to the specified clustering order= , > the latest modification will always be first record in the row. > > Hope it helps. > > Carlos Alonso | Software Engineer | @calonso > > > On 21 July 2015 at 05:59, Robert Wille wrote: > >> Data structures that have a recently-modified access pattern seem to be = a >> poor fit for Cassandra. I=E2=80=99m wondering if any of you smart guys c= an provide >> suggestions. >> >> For the sake of discussion, lets assume I have the following tables: >> >> CREATE TABLE document ( >> docId UUID, >> doc TEXT, >> last_modified TIMEUUID, >> PRIMARY KEY ((docid)) >> ) >> >> CREATE TABLE doc_by_last_modified ( >> date TEXT, >> last_modified TIMEUUID, >> docId UUID, >> PRIMARY KEY ((date), last_modified) >> ) >> >> When I update a document, I retrieve its last_modified time, delete the >> current record from doc_by_last_modified, and add a new one. Unfortunate= ly, >> if you=E2=80=99d like each document to appear at most once in the >> doc_by_last_modified table, then this doesn=E2=80=99t work so well. >> >> Documents can get into the doc_by_last_modified table multiple times if >> there is concurrent access, or if there is a consistency issue. >> >> Any thoughts out there on how to efficiently provide recently-modified >> access to a table? This problem exists for many types of data structures= , >> not just recently-modified. Any ordered data structure that can be >> dynamically reordered suffers from the same problems. As I=E2=80=99ve be= en doing >> schema design, this pattern keeps recurring. A nice way to address this >> problem has lots of applications. >> >> Thanks in advance for your thoughts >> >> Robert >> >> > > --001a1133a352bd1b30051b730f4a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Ah, so you your access pattern is to get all documents mod= ified in a particular date, right?

Then I think your app= roach is good, and to avoid duplication, why don't add the docId as the= first clustering column and remove the last_modified field from it?
<= div>That way, your primary key would be PRIMARY KEY(date, docId), making al= l docs modified in same day be together in the same partition, and on the o= ther hand, two updates on the same date won't generate a two rows as th= e primary key would be exactly the same.

Does it m= ake sense?

Carlos Alonso | Software Engineer |=C2=A0@calonso

On 21 July 2015 at 18:37, Robert Wille <rwil= le@fold3.com> wrote:
The time series doesn=E2=80=99t provide the access pattern I=E2=80=99m look= ing for. No way to query recently-modified documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso <info@mrcalonso.com> wrote:

Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
=C2=A0 docId UUID,
=C2=A0 doc TEXT,
=C2=A0 last_modified TIMESTAMP
=C2=A0 PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record= in the row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId =3D=3D <the docId> LIMIT 1;

And, if you experience diskspace issues due to very long rows, then yo= u can always expire old ones using TTL or on a batch job. Tombstones will n= ever be a problem in this case as, due to the specified clustering order, t= he latest modification will always be first record in the row.

Hope it helps.

Carlos Alonso= | Software Engineer |=C2=A0@calonso

On 21 July 2015 at 05:59, Robert Wille <rwille@f= old3.com> wrote:
Data structures that have a recently-modified access pattern seem to be a p= oor fit for Cassandra. I=E2=80=99m wondering if any of you smart guys can p= rovide suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
=C2=A0 =C2=A0 =C2=A0 =C2=A0 docId UUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 doc TEXT,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 last_modified TIMEUUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
=C2=A0 =C2=A0 =C2=A0 =C2=A0 date TEXT,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 last_modified TIMEUUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 docId UUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the cur= rent record from doc_by_last_modified, and add a new one. Unfortunately, if= you=E2=80=99d like each document to appear at most once in the doc_by_last= _modified table, then this doesn=E2=80=99t work so well.

Documents can get into the doc_by_last_modified table multiple times if the= re is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified acce= ss to a table? This problem exists for many types of data structures, not j= ust recently-modified. Any ordered data structure that can be dynamically r= eordered suffers from the same problems. As I=E2=80=99ve been doing schema design, this pattern keeps recurring. A = nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert




--001a1133a352bd1b30051b730f4a--