Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1185A950D for ; Fri, 4 Nov 2011 17:08:10 +0000 (UTC) Received: (qmail 77509 invoked by uid 500); 4 Nov 2011 17:08:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 77478 invoked by uid 500); 4 Nov 2011 17:08:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 77470 invoked by uid 99); 4 Nov 2011 17:08:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Nov 2011 17:08:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of arodrime@gmail.com designates 209.85.215.172 as permitted sender) Received: from [209.85.215.172] (HELO mail-ey0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Nov 2011 17:08:01 +0000 Received: by eyg24 with SMTP id 24so2644402eyg.31 for ; Fri, 04 Nov 2011 10:07:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; bh=04WD/TMQqxL1PWSag5eMpFdLfTOMnzTE6OlgDlQfKqk=; b=ArV5rxAK9C65HmZmm8t15iaNH4dt/gaHFxPe0hjWTFkLHsiCfTVmQobMQQ/PYPurYf YAi+V7tNgxtCV22fZkJZEwfW9idvK14jUoU/96xkKtmgugR5gOnNV9QRvg6cdQUZlgRO RthxeR2EXEj8jcfcGq1ApaNZ/Ln2AI15uDW3g= Received: by 10.213.9.198 with SMTP id m6mr355091ebm.43.1320426460909; Fri, 04 Nov 2011 10:07:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.213.29.74 with HTTP; Fri, 4 Nov 2011 10:07:19 -0700 (PDT) From: Alain RODRIGUEZ Date: Fri, 4 Nov 2011 18:07:19 +0100 Message-ID: Subject: Modeling big data to allow filtering with a lot of distinct combinations of dimesions, in real time and with no latency To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015174c1a302797ca04b0ebbed3 X-Virus-Checked: Checked by ClamAV on apache.org --0015174c1a302797ca04b0ebbed3 Content-Type: text/plain; charset=ISO-8859-1 Hi all, I started this thread in the phpCassa google group, but I thinks its place is here. There is my first post : "I was wondering about a specific point of Cassandra Modeling. If I need to know the number of connexion to my website using each browser, every hour, I can do: Row key: $browser, column key: date('YmdH', $timestamp), value: counter. I can increment this counter for any visit, this should work. The point is that I want to be able to render the results of a lot of statistics used as filters. I mean, I will have information such as browser, browser version, screen resolution, OS, OS version, localization... And I want to allow users to get data (number of views) filtering it as much as they want. For example, if I want to know how many people visited my website with safari, windos, and from New York, every hour, I can store: Row key : $browser:$os:$localization, column key : date('YmdH', $timestamp), value : counter. This can't be the best solution because according to the combinational mathematics I will have to store n! counters to be able to store data with all filters. If I got 10 filters I will increment 3 628 800 counters. That's not the good solution, for sure. How am I supposed to model this to be able to read data with any filter I want ? Thanks, Alain" And there is the first answer given (thanks to Tyler Hobbs) : "Technically, the number of potential different counters would be the cardinality of each field multiplied together. (Since one of the fields holds a time, this number would continue to grow.) However, in practice you'll have far fewer than this number of counters, because not every possible combination of these will happen. >That's not the good solution, for sure. How am I supposed to model > this to be able to read data with any filter I want ? It's a reasonable solution if you want to be able to drill down and filter by any attribute. If you want to be able to filter based on all of these attributes, you have to store that information about every request in one way or another." I know it's a non-trivial problem, but I'm sure that some people already faced this problem before I do. I'll allow user to filter however they want, chosing dimensions with checkboxes. They will be able to combine dimensions and ask for any combination. So, with this solution, I will have to store every event n times, with n = number of possible combinations. I saw this yesterday : http://t.co/EXL6yAO8 (thanks to Dave Gardner). This company seems to something equivalent of the idea exposed in my first post.... Any experience to share with this kind of problem ? thank you, Alain --0015174c1a302797ca04b0ebbed3 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi all,

I started this thread in = the phpCassa google group, but I thinks its place is here.

There is my first post :

"I was wonderin= g about a specific point of Cassandra Modeling.=A0

If I need to know the number of connexion to my website using each browser= , every hour, I can do:=A0

Row key: $browser, column key: date('YmdH', $ti= mestamp), value: counter.=A0

I can increment this = counter for any visit, this should work. The point is that I want to be abl= e to render the results of a lot of statistics used as filters.=A0

I mean, I will have information such as browser, browse= r version, screen resolution, OS, OS version, localization... And I want to= allow users to get data (number of views) filtering it as much as they wan= t.

For example, if I want to know how many people visited = my website with safari, windos, and from New York, every hour, I can store:= =A0

Row key : $browser:$os:$localization, column k= ey : date('YmdH', $timestamp), value : counter.=A0

This can't be the best solution because according t= o the combinational mathematics I will have to store n! counters to be able= to store data with all filters. If I got 10 filters I will increment 3 628= 800 counters.=A0

That's not the good solution, for sure. How am I su= pposed to model this to be able to read data with any filter I want ? =A0

Thanks,=A0

Alain"



And there is the first an= swer given (thanks to Tyler Hobbs) :

"Technic= ally, the number of potential different counters would be the cardinality o= f each field multiplied together. =A0(Since one of the fields holds a time,= this number would continue to grow.) However, in practice you'll have = far fewer than this number of counters, because not every possible combinat= ion of these will happen.=A0

>That's not the good solution, for sure. How am = I supposed to model=A0

> this to be able to rea= d data with any filter I want ?=A0

It's a reas= onable solution if you want to be able to drill down and filter by any attr= ibute. =A0If you want to be able to filter based on all of these attributes= , you have to store that information about every request in one way or anot= her."



I know it's a non-tri= vial problem, but I'm sure that some people already faced this problem = before I do.

I'll allow user to filter however= they want, chosing dimensions with checkboxes. They will be able to combin= e dimensions and ask for any combination.

So, with this solution, I will have to store every even= t n times, with n =3D number of possible combinations.=A0

I saw this yesterday : http://t.co= /EXL6yAO8 (thanks to Dave Gardner). This company seems to something equ= ivalent of the idea exposed in my first post....

Any experience to share with this kind of problem ?

thank you,

Alain

--0015174c1a302797ca04b0ebbed3--