incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <arodr...@gmail.com>
Subject Modeling big data to allow filtering with a lot of distinct combinations of dimesions, in real time and with no latency
Date Fri, 04 Nov 2011 17:07:19 GMT
Hi all,

I started this thread in the phpCassa google group, but I thinks its place
is here.

There is my first post :

"I was wondering about a specific point of Cassandra Modeling.

If I need to know the number of connexion to my website using each browser,
every hour, I can do:

Row key: $browser, column key: date('YmdH', $timestamp), value: counter.

I can increment this counter for any visit, this should work. The point is
that I want to be able to render the results of a lot of statistics used as
filters.

I mean, I will have information such as browser, browser version, screen
resolution, OS, OS version, localization... And I want to allow users to
get data (number of views) filtering it as much as they want.

For example, if I want to know how many people visited my website with
safari, windos, and from New York, every hour, I can store:

Row key : $browser:$os:$localization, column key : date('YmdH',
$timestamp), value : counter.

This can't be the best solution because according to the combinational
mathematics I will have to store n! counters to be able to store data with
all filters. If I got 10 filters I will increment 3 628 800 counters.

That's not the good solution, for sure. How am I supposed to model this to
be able to read data with any filter I want ?

Thanks,

Alain"



And there is the first answer given (thanks to Tyler Hobbs) :

"Technically, the number of potential different counters would be the
cardinality of each field multiplied together.  (Since one of the fields
holds a time, this number would continue to grow.) However, in practice
you'll have far fewer than this number of counters, because not every
possible combination of these will happen.

>That's not the good solution, for sure. How am I supposed to model

> this to be able to read data with any filter I want ?

It's a reasonable solution if you want to be able to drill down and filter
by any attribute.  If you want to be able to filter based on all of these
attributes, you have to store that information about every request in one
way or another."



I know it's a non-trivial problem, but I'm sure that some people already
faced this problem before I do.

I'll allow user to filter however they want, chosing dimensions with
checkboxes. They will be able to combine dimensions and ask for any
combination.

So, with this solution, I will have to store every event n times, with n =
number of possible combinations.

I saw this yesterday : http://t.co/EXL6yAO8 (thanks to Dave Gardner). This
company seems to something equivalent of the idea exposed in my first
post....

Any experience to share with this kind of problem ?

thank you,

Alain

Mime
View raw message