Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B5F3E200D1A for ; Mon, 9 Oct 2017 20:56:42 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B44161609CE; Mon, 9 Oct 2017 18:56:42 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 81B2F1609B8 for ; Mon, 9 Oct 2017 20:56:41 +0200 (CEST) Received: (qmail 89663 invoked by uid 500); 9 Oct 2017 18:56:39 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 89653 invoked by uid 99); 9 Oct 2017 18:56:39 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Oct 2017 18:56:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E4B451A19DE for ; Mon, 9 Oct 2017 18:56:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.191 X-Spam-Level: X-Spam-Status: No, score=-0.191 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, T_DKIM_INVALID=0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=indeni.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id WSrVGhV5WHlA for ; Mon, 9 Oct 2017 18:56:36 +0000 (UTC) Received: from mail-oi0-f44.google.com (mail-oi0-f44.google.com [209.85.218.44]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9A8EC5FD42 for ; Mon, 9 Oct 2017 18:56:35 +0000 (UTC) Received: by mail-oi0-f44.google.com with SMTP id j126so17802381oib.8 for ; Mon, 09 Oct 2017 11:56:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=indeni.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=xeb9OebJNgBPNSHgOVV7vF5O1lQrU35FJT4rWq62Ivs=; b=gd0Oka72cSzthzPbLokdhPPUMaXOREy66laaun0928oXmgSjA92XqEtZJNhlVSAyiA 85F3oWTvN2v/tTIp3aBNuSohecpfH8jnvpVWr/WT4hkD3gpRfJbCQqOd7OYBCqOvSfeI fVtDJmUORJNdX5kNPa9ovK1x3X8xmJXNZTurM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=xeb9OebJNgBPNSHgOVV7vF5O1lQrU35FJT4rWq62Ivs=; b=RtFRdDHClMil0Dt/1GMiFQ3PzHlX1Bv9Ows1b09DaAkXvu66v222PJGb3lEyrEepqq eiZ3Bdm2vt/3FK70POaBoykJ8vQzyS+o+2BC+FqdGb3fhVBBidzCCyj7M+kVddh+WLB7 ZpW6gWAiNHe+9Tg6uJ+Y9AyR2iaJUc1gXAX0MyEGAg3KDnLEEg6nIzlTYUiB6qoGeeGc UYPyEpXO4O6SvLHmP5G0+Ej5zLreSJxjXP6KNi26eyDDVmpoHDeWCl9fJiXKTW/WZM2C UYSLbe5r9nZ72sZL7r8rhdzJxtNRjYYK9/L4YUNgyuMWThJbnWdW34MiPlUlCx75DHEu vtYQ== X-Gm-Message-State: AMCzsaVEpgJHo78ezdnjfd+/4pzjAujdkSRGdrev2XyW55gWeiwfNHpO EfdDQpuWlgpxIMz13GHrbnuEAWFzvfRJ8IMqqxsLKPJ2 X-Google-Smtp-Source: AOwi7QB9y4oAXfrBvQh4hL6/9yj9fVhE5oWtQeXfsV2t9aToSOgHrx/u2E4X3g0UUoTPG9E3lgnYpDfTvn1PthsnWzU= X-Received: by 10.202.170.86 with SMTP id t83mr4545506oie.135.1507575394601; Mon, 09 Oct 2017 11:56:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.157.51.166 with HTTP; Mon, 9 Oct 2017 11:56:34 -0700 (PDT) Received: by 10.157.51.166 with HTTP; Mon, 9 Oct 2017 11:56:34 -0700 (PDT) In-Reply-To: References: From: Avi Levi Date: Mon, 9 Oct 2017 21:56:34 +0300 Message-ID: Subject: Re: Using materialized view or AllowFiltering which one is better ? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="001a113cc55cde2fa5055b21bf8d" archived-at: Mon, 09 Oct 2017 18:56:42 -0000 --001a113cc55cde2fa5055b21bf8d Content-Type: text/plain; charset="UTF-8" Thanks Crisan . I understand what you're saying. But according to your suggestion I will have a record for every entry while I am interested only on the last entry . So the proposed solution is actually keeping much more data then needed . On Oct 9, 2017 8:40 PM, "Valentina Crisan" wrote: Allow filtering is almost never the answer, especially when you want to do a full table scan ( there might be some cases where the query is limited to a partition and allow filtering could be used). And you would like to run this query every minute - thus extremely good performance is required. Allow filtering basically brings locally in your coordinator the whole table content and performs local filtering of the data before answering your query. Performance wise is not recommended to use such an implementation. For a query running every minute you need to address it in one partition read (according to Cassandra data modeling rules) and that can be done with denormalization ( manually or materialized views). As far as I know and also from the discussions in this list MV should be used still with caution in production environments. Thus, the best option in my opinion is manual denormalization of data, building a table with partition key last_seen and clustering key username and adding/updating data accordingly. Furthermore last_seen I understand it's a value of any time/hour of day - you could consider building partitions per day: partition key = (last_seen, day), primary key = ((last_seen,day),username)). Valentina On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi wrote: > Hi > > I have the following table: > > CREATE TABLE users ( > username text, > last_seen bigint, > PRIMARY KEY (username) > ); > > where* last_seen* is basically the writetime . Number of records in the > table is aprox 10 million. Insert is pretty much straightforward insert > into users (username, last_seen) VALUES ([username], now) > > I want to make some processing on users that were not seen for the past > XXX (where xxx can be hours/days ... ) by query the last_seen column > (this query runs every minute) e.g : > > select username from users where last_seen < (now - 1 day). > > I have two options as I see it: > > 1. use materialized view : > > CREATE MATERIALIZED VIEW users_last_seen AS > SELECT last_seen, username > FROM users > WHERE last_seen IS NOT NULL > PRIMARY KEY (last_seen, username); > > > and simply query: > > select username from users_last_seen where last_seen < (now - 1 day) > > 1. > > query the users table > > select username from users where last_seen < (now - 1 day) ALLOW > FILTERING > > which one is more efficient? any other options ? > > Any help will be greatly appreciated > > Best > > Avi > --001a113cc55cde2fa5055b21bf8d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks Crisan .
I understand = what you're saying. But according to your suggestion I will have a reco= rd for every entry while I am interested only on the last entry . So the pr= oposed solution is actually keeping much more data then needed .

On Oct 9, 2= 017 8:40 PM, "Valentina Crisan" <valentina.crisan@gmail.com> wrote:
Allow filtering is almost= never the answer, especially when you want to do a full table scan ( there= might be some cases where the query is limited to a partition and allow fi= ltering could be used). And you would like to run this query every minute -= thus extremely good performance is required. Allow filtering basically bri= ngs locally in your coordinator the whole table content and performs local = filtering of the data before answering your query. Performance wise is not = recommended to use such an implementation.=C2=A0

For a query r= unning every minute you need to address it in one partition read (according= to Cassandra data modeling rules) and that can be done with denormalizatio= n ( manually or materialized views). As far as I know and also from the dis= cussions in this list MV should be used still with caution in production en= vironments. Thus, the best option in my opinion is manual denormalization o= f data, building a table with partition key last_seen and clustering key us= ername and adding/updating data accordingly. Furthermore last_seen I unders= tand it's a value of any time/hour of day - you could consider building= partitions per day: partition key=C2=A0 =3D (last_seen, day),=C2=A0 primar= y key =3D ((last_seen,day),username)).=C2=A0 =C2=A0 =C2=A0=C2=A0

Valentina=C2=A0 =C2=A0=C2=A0

=
On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <avi@inde= ni.com> wrote:

Hi=C2= =A0

I have the following table:

C=
REATE TABLE users (
    username text,
    last_seen bigint,
    PRIMARY KEY (username)
);

= where last_seen is basically the writetime . Number of records in = the table is aprox 10 million. Insert is pretty much straightforward=C2=A0<= /span>insert int= o users (username, last_seen) VALUES ([username], now)

I wa= nt to make some processing on users that were not seen for the past XXX (wh= ere xxx can be hours/days ... ) by query the=C2=A0last_seen=C2=A0column (this query = runs every minute) e.g :

select username from users where last_seen < (now - 1 da= y).

I have two options as I see it:

  1. use materialized view :
CREATE MATERIALIZED VIEW users_last_seen AS
SELECT last_seen, username
FROM users
WHERE last_seen IS NOT NULL
PRIMARY KEY (last_seen, username);

and simply query:

select username= from users_last_seen where last_seen < (now - 1 day)

  1. query the users table

    select username from users = where last_seen < (now - 1 day) ALLOW FILTERING

= which one is more efficient? any other options ?

Any help will be gr= eatly appreciated

Best

Avi=C2=A0



--001a113cc55cde2fa5055b21bf8d--