cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zachary Marcantel (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-4511) Secondary index support for CQL3 collections
Date Tue, 19 Nov 2013 00:05:23 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825918#comment-13825918
] 

Zachary Marcantel edited comment on CASSANDRA-4511 at 11/19/13 12:04 AM:
-------------------------------------------------------------------------

I can think of (and need this for) a few use cases. They do revolve around more of a filtering
aspect, however.

It can be said that sometimes models/data/rows belong to a "theoretical list" that contain
infinite possibilities. 

Naturally, you do not want to store all possibilities, but may want to filter on those which
are true while ignoring all other possiblities.

Examples of these lists could be:
- Things of interest to a user
  -- LinkedIn calls these skills, Facebook has 'liked pages', etc
- Movies Watched
  -- Netflix surely doesn't (want to) have a USER x MOVIES sized table
  -- Nor do they want (user x movies) number of columnfamilies
- Places Visited
  -- set yes/no for EVERY location on Earth?

Benchmarks may prove me wrong, but theoretically the performance hit would be minimal if the
data is truly partitioned well, collections are kept small, and secondary indexing used only
as a filter and not data storage.

Dynamic columns may make some of my examples easier, but bring their own headaches (post-filling
dynamically created columns, massively wide tables, largely unused data == disk bloat).

I'll give a couple examples and use [~jbellis] syntax, as well as a potential map-based indexing.

- Users contained in group(s):
-- Note: this could be done with columns, but if we assume groups can contain infinitely many
possibilities (like Facebook groups), this becomes an issue)
- {code:sql}
SELECT * FROM main.users WHERE 'players' IN groups AND 'admins' NOT IN groups;
{code}

- Filter on toggle-based UI elements within user profiles:
{code:sql}
SELECT * FROM main.users WHERE notify['email'] = true;
{code}


Currently, one would have to detail the entirety of the list that has been seen in one of
three ways:

{code:sql}
CREATE TABLE main.interests (
    interest_name TEXT PRIMARY KEY,
    users LIST <TEXT>
);
{code}

OR

{code:sql}
CREATE TABLE main.users (
    id UUID PRIMARY KEY,
    ... other user fields ...
    interests LIST <TEXT>
);
{code}

OR

{code:SQL}
CREATE TABLE main.users (
    id UUID PRIMARY KEY,
    a BOOLEAN,
    b BOOLEAN,
    ... Iterate through possibilities ...
    z BOOLEAN
);
{code}

The first two would require post-result processing (map-reduce or similar) to find just the
users containing a certain key/value. The last example would require much wasted disk space
and post-filling of dynamically created columns. 

Rather, with indexing:
{code:sql}
CREATE TABLE main.users (
    id UUID PRIMARY KEY,
    name TEXT,
    age INT,
    interests LIST <TEXT>
);
{code}

where 'interests' is a relatively small (~10-25 elements) list that can be filtered by:

{code:sql}
SELECT * FROM main.users WHERE 'baseball' IN interests AND 'soccer' IN interests;
{code}


was (Author: zmarcantel):
I can think of (and need this for) a few use cases. They do revolve around more of a filtering
aspect, however.

It can be said that sometimes models/data/rows belong to a "theoretical list" that contain
infinite possibilities. 

Naturally, you do not want to store all possibilities, but may want to filter on those which
are true while ignoring all other possiblities.

Examples of these lists could be:
- Things of interest to a user
  -- LinkedIn calls these skills, Facebook has 'liked pages', etc
- Movies Watched
  -- Netflix surely doesn't (want to) have a USER x MOVIES sized table
  -- Nor do they want (user x movies) number of columnfamilies
- Places Visited
  -- set yes/no for EVERY location on Earth?

Benchmarks may prove me wrong, but theoretically the performance hit would be minimal if the
data is truly partitioned well, collections are kept small, and secondary indexing used only
as a filter and not data storage.

Dynamic columns may make some of my examples easier, but bring their own headaches (post-filling
dynamically created columns, massively wide tables, largely unused data == disk bloat).

I'll give a couple examples and use [~jbellis] syntax, as well as a potential map-based indexing.

- Users contained in group(s):
-- Note: this could be done with columns, but if we assume groups can contain infinitely many
possibilities (like Facebook groups)
- {code:sql}
SELECT * FROM main.users WHERE 'players' IN groups AND 'admins' NOT IN groups;
{code}

- Filter on toggle-based UI elements within user profiles:
{code:sql}
SELECT * FROM main.users WHERE notify['email'] = true;
{code}


Given a possibly endless list, map rows (or data pieces) onto items within that list:
For instance, user profiles often have 'interests' that could contain one/multiple of thousands
if not millions of possibilities.
Currently, one would have to detail the entirety of the list that has been seen in one of
two ways:
{code:sql}
CREATE TABLE main.interests (
    interest_name TEXT PRIMARY KEY,
    users LIST <TEXT>
);
{code}

OR

{code:sql}
CREATE TABLE main.users (
    id UUID PRIMARY KEY,
    ... other user fields ...
    interests LIST <TEXT>
);
{code}

Both of which would require post-result processing (map-reduce or similar) to find just the
users containing a certain key/value.

Rather, with indexing:
{code:sql}
CREATE TABLE main.users (
    id UUID PRIMARY KEY,
    name TEXT,
    age INT,
    interests LIST <TEXT>
);
{code}

where 'interests' is a relatively small (~10-25 elements) list that can be filtered by:

{code:sql}
SELECT * FROM main.users WHERE 'baseball' IN interests AND 'soccer' IN interests;
{code}

> Secondary index support for CQL3 collections 
> ---------------------------------------------
>
>                 Key: CASSANDRA-4511
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4511
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 2.1
>
>         Attachments: 4511.txt
>
>
> We should allow to 2ndary index on collections. A typical use case would be to add a
'tag set<String>' to say a user profile and to query users based on what tag they have.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message