Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB28F18564 for ; Fri, 24 Jul 2015 03:49:16 +0000 (UTC) Received: (qmail 70864 invoked by uid 500); 24 Jul 2015 03:49:13 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 70824 invoked by uid 500); 24 Jul 2015 03:49:13 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 70811 invoked by uid 99); 24 Jul 2015 03:49:13 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jul 2015 03:49:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id AD0B818DF50 for ; Fri, 24 Jul 2015 03:49:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id hGtt_1if-lXT for ; Fri, 24 Jul 2015 03:49:03 +0000 (UTC) Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id E010321048 for ; Fri, 24 Jul 2015 03:49:02 +0000 (UTC) Received: by wibxm9 with SMTP id xm9so10566308wib.0 for ; Thu, 23 Jul 2015 20:49:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=0nJwjO9/dtpUvTCtqmBEi1n9b9DJruxwNmZntOA1WiA=; b=Rnc6HU1Te67p8Yze+r0Te6aTaHelxab6B4ExngcbzVI9FgXf7J3TOIMBiUHKThCTr5 nARCmKeNH5eGxJy70ObbwJjJqN5YMj9D9pfrYI00Fz0SFdWD0/2x/SKS+cQh2kykyYhx d791rGfQi+ST+CFx8EftAHwBHCiWJpoNeFK6Q3kujwTLkReybQd/HF3iLQ2ELE26/bTL /t88upciRCeiUsV5FuyJBuDUy10Eicc7YBMMUeSqCUL34hq0554rnRi49WQU4M85Dl20 zNWieUyrAXEwuM2EY5SFLbIm/pCkHVIvYxsvlWiy+LggQs4d9Z4EnPUtv4dWu1UQnTEN 6iDw== MIME-Version: 1.0 X-Received: by 10.194.6.37 with SMTP id x5mr24637717wjx.73.1437709741595; Thu, 23 Jul 2015 20:49:01 -0700 (PDT) Received: by 10.27.46.2 with HTTP; Thu, 23 Jul 2015 20:49:01 -0700 (PDT) In-Reply-To: <10FDC9A8-06E8-48D0-86C1-45D7A73E5E26@fold3.com> References: <366A8B84-FB78-4DF4-B95B-B808219C0E3E@fold3.com> <3EAD9346-7699-49AE-A7D0-7E6FA706C39E@fold3.com> <10FDC9A8-06E8-48D0-86C1-45D7A73E5E26@fold3.com> Date: Thu, 23 Jul 2015 23:49:01 -0400 Message-ID: Subject: Re: Schema questions for data structures with recently-modified access patterns From: Jack Krupansky To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=047d7b5d9acd7020dd051b96e27f --047d7b5d9acd7020dd051b96e27f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Concurrent update should not be problematic. Duplicate entries should not be created. If it appears to be, explain your apparent issue so we can see whether it is a real issue. But at least from all of the details you have disclosed so far, there does not appear to be any indication that this type of time series would be anything other than a good fit for Cassandra. Besides, the new materialized view feature of Cassandra 3.0 would make it an even easier fit. -- Jack Krupansky On Thu, Jul 23, 2015 at 6:30 PM, Robert Wille wrote: > I obviously worded my original email poorly. I guess that=E2=80=99s what= happens > when you post at the end of the day just before quitting. > > I want to get a list of documents, ordered from most-recently modified > to least-recently modified, with each document appearing exactly once. > > Jack, your schema does exactly that, and is essentially the same as mine > (with exception of my missing the DESC clause, and I have a partitioning > column and you only have clustering columns). > > The problem I have with my schema (or Jack=E2=80=99s) is that it is very= easy > for a document to get in the list multiple times. Concurrent updates to t= he > document, for example. Also, a consistency issue could cause the document > to appear in the list more than once. > > I think that Alec Collier=E2=80=99s comment is probably accurate, that t= his kind > of a pattern just isn=E2=80=99t a good fit for Cassandra. > > On Jul 23, 2015, at 1:54 PM, Jack Krupansky > wrote: > > Maybe you could explain in more detail what you mean by recently > modified documents, since that is precisely what I thought I suggested wi= th > descending ordering. > > -- Jack Krupansky > > On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille wrote: > >> Carlos=E2=80=99 suggestion (nor yours) didn=E2=80=99t didn=E2=80=99t pro= vide a way to query >> recently-modified documents. >> >> His updated suggestion provides a way to get recently-modified >> documents, but not ordered. >> >> On Jul 22, 2015, at 4:19 PM, Jack Krupansky >> wrote: >> >> "No way to query recently-modified documents." >> >> I don't follow why you say that. I mean, that was the point of the data >> model suggestion I proposed. Maybe you could clarify. >> >> I also wanted to mention that the new materialized view feature of >> Cassandra 3.0 might handle this use case, including taking care of the >> delete, automatically. >> >> >> -- Jack Krupansky >> >> On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille wrote: >> >>> The time series doesn=E2=80=99t provide the access pattern I=E2=80=99m = looking for. No >>> way to query recently-modified documents. >>> >>> On Jul 21, 2015, at 9:13 AM, Carlos Alonso wrote: >>> >>> Hi Robert, >>> >>> What about modelling it as a time serie? >>> >>> CREATE TABLE document ( >>> docId UUID, >>> doc TEXT, >>> last_modified TIMESTAMP >>> PRIMARY KEY(docId, last_modified) >>> ) WITH CLUSTERING ORDER BY (last_modified DESC); >>> >>> This way, you the lastest modification will always be the first record >>> in the row, therefore accessing it should be as easy as: >>> >>> SELECT * FROM document WHERE docId =3D=3D LIMIT 1; >>> >>> And, if you experience diskspace issues due to very long rows, then >>> you can always expire old ones using TTL or on a batch job. Tombstones = will >>> never be a problem in this case as, due to the specified clustering ord= er, >>> the latest modification will always be first record in the row. >>> >>> Hope it helps. >>> >>> Carlos Alonso | Software Engineer | @calonso >>> >>> >>> On 21 July 2015 at 05:59, Robert Wille wrote: >>> >>>> Data structures that have a recently-modified access pattern seem to b= e >>>> a poor fit for Cassandra. I=E2=80=99m wondering if any of you smart gu= ys can >>>> provide suggestions. >>>> >>>> For the sake of discussion, lets assume I have the following tables: >>>> >>>> CREATE TABLE document ( >>>> docId UUID, >>>> doc TEXT, >>>> last_modified TIMEUUID, >>>> PRIMARY KEY ((docid)) >>>> ) >>>> >>>> CREATE TABLE doc_by_last_modified ( >>>> date TEXT, >>>> last_modified TIMEUUID, >>>> docId UUID, >>>> PRIMARY KEY ((date), last_modified) >>>> ) >>>> >>>> When I update a document, I retrieve its last_modified time, delete th= e >>>> current record from doc_by_last_modified, and add a new one. Unfortuna= tely, >>>> if you=E2=80=99d like each document to appear at most once in the >>>> doc_by_last_modified table, then this doesn=E2=80=99t work so well. >>>> >>>> Documents can get into the doc_by_last_modified table multiple times i= f >>>> there is concurrent access, or if there is a consistency issue. >>>> >>>> Any thoughts out there on how to efficiently provide recently-modified >>>> access to a table? This problem exists for many types of data structur= es, >>>> not just recently-modified. Any ordered data structure that can be >>>> dynamically reordered suffers from the same problems. As I=E2=80=99ve = been doing >>>> schema design, this pattern keeps recurring. A nice way to address thi= s >>>> problem has lots of applications. >>>> >>>> Thanks in advance for your thoughts >>>> >>>> Robert >>>> >>>> >>> >>> >> >> > > --047d7b5d9acd7020dd051b96e27f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Concurrent update should not be problematic. Duplicate ent= ries should not be created. If it appears to be, explain your apparent issu= e so we can see whether it is a real issue.

But at least= from all of the details you have disclosed so far, there does not appear t= o be any indication that this type of time series would be anything other t= han a good fit for Cassandra.

Besides, the new mat= erialized view feature of Cassandra 3.0 would make it an even easier fit.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 6:30 PM, Robert Will= e <rwille@fold3.com> wrote:
I obviously worded my original email poorly. I guess that=E2=80=99s what ha= ppens when you post at the end of the day just before quitting.

I want to get a list of documents, ordered from most-recently modified= to least-recently modified, with each document appearing exactly once.

Jack, your schema does exactly that, and is essentially the same as mi= ne (with exception of my missing the DESC clause, and I have a partitioning= column and you only have clustering columns).

The problem I have with my schema (or Jack=E2=80=99s) is that it is ve= ry easy for a document to get in the list multiple times. Concurrent update= s to the document, for example. Also, a consistency issue could cause the d= ocument to appear in the list more than once.

I think that Alec Collier=E2=80=99s comment is probably accurate, that= this kind of a pattern just isn=E2=80=99t a good fit for Cassandra.
<= div>

On Jul 23, 2015, at 1:54 PM, Jack Krupansky <jack.krupansky@gmail.com> wr= ote:

Maybe you could explain in more detail what you mean by re= cently modified documents, since that is precisely what I thought I suggest= ed with descending ordering.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille <rwille@fold3.com<= /a>> wrote:
Carlos=E2=80=99 suggestion (nor yours) = didn=E2=80=99t didn=E2=80=99t provide a way to query recently-modified docu= ments.

His updated suggestion provides a way to get recently-modified documen= ts, but not ordered.


"No way = to query recently-modified documents."

I don't follow why you say that. I mean, that was the point of the= data model suggestion I proposed. Maybe you could clarify.

I also wanted to mention that the new materialized view feature of Cas= sandra 3.0 might handle this use case, including taking care of the delete,= automatically.


-- Jack Krupansky

On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille <= span dir=3D"ltr"> <rwille@fold3.com<= /a>> wrote:
The time series doesn=E2=80=99t provide= the access pattern I=E2=80=99m looking for. No way to query recently-modif= ied documents.


Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
=C2=A0 docId UUID,
=C2=A0 doc TEXT,
=C2=A0 last_modified TIMESTAMP
=C2=A0 PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record= in the row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId =3D=3D <the docId> LIMIT 1;

And, if you experience diskspace issues due to very long rows, then yo= u can always expire old ones using TTL or on a batch job. Tombstones will n= ever be a problem in this case as, due to the specified clustering order, t= he latest modification will always be first record in the row.

Hope it helps.

Carlos Alonso= | Software Engineer |=C2=A0@calonso

On 21 July 2015 at 05:59, Robert Wille <rwille@f= old3.com> wrote:
Data structures that have a recently-modified access pattern seem to be a p= oor fit for Cassandra. I=E2=80=99m wondering if any of you smart guys can p= rovide suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
=C2=A0 =C2=A0 =C2=A0 =C2=A0 docId UUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 doc TEXT,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 last_modified TIMEUUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
=C2=A0 =C2=A0 =C2=A0 =C2=A0 date TEXT,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 last_modified TIMEUUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 docId UUID,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the cur= rent record from doc_by_last_modified, and add a new one. Unfortunately, if= you=E2=80=99d like each document to appear at most once in the doc_by_last= _modified table, then this doesn=E2=80=99t work so well.

Documents can get into the doc_by_last_modified table multiple times if the= re is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified acce= ss to a table? This problem exists for many types of data structures, not j= ust recently-modified. Any ordered data structure that can be dynamically r= eordered suffers from the same problems. As I=E2=80=99ve been doing schema design, this pattern keeps recurring. A = nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert








--047d7b5d9acd7020dd051b96e27f--