incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Savage <davemssav...@gmail.com>
Subject Re: CQL Select Map using an IN relationship
Date Thu, 13 Mar 2014 20:01:51 GMT
Thanks for the explanations.

To confirm I understand, Michaels explanation seems to say that that :

* the partition key supports =/IN but not >,>=,<,<=
* the range key (or clustering column) supports =,>,>=,<,<= but not IN. Is
that correct?

Jacks explanation seems to say that by grouping the two columns in the
primary key ((key1, key2)) this will prevent data from being partitioned
across nodes in the cluster, is that correct?

Also in another response thread Sylvian seemed to hint that it's historical
that IN is not supported on the range key / clustering column [1]. If I've
understood that correctly I'm happy to raise a jira ticket to track this so
it can be fixed.

Thanks for your help.

Kind regards,

Dave

[1] Please let me know if I should pick one of these terms for clarity...


On Thu, Mar 13, 2014 at 6:08 PM, Jack Krupansky <jack@basetechnology.com>wrote:

>   "range key" is formally known as "clustering column". One or more
> clustering columns can be specified to identify individual rows in a
> partition. Without clustering columns, one partition is one row. So, it's a
> matter of whether you want your rows to be in the same partition or
> distributed.
>
> -- Jack Krupansky
>
>  *From:* Laing, Michael <michael.laing@nytimes.com>
> *Sent:* Thursday, March 13, 2014 1:39 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: CQL Select Map using an IN relationship
>
>  Think of them as:
>
>
> PRIMARY KEY (partition_key[, range_key])
>
>
> where the partition_key can be compounded as:
>
>
> (partition_key0 [, partition_key1, ...])
>
>
> and the optional range_key can be compounded as:
>
>
> range_key0 [, range_key1 ...]
>
>
> If you do this: PRIMARY KEY (key1, key2) - then key1 is the partition_key
> and key2 is the range_key and queries will work that hash to key1 (the
> partition) using = or IN and specify a range on key2.
>
> But if you do this: PRIMARY key ((key1, key2)) then (key1, key2) is the
> compound partition key - there is no range key - and you can specify = on
> key1 and = or IN on key2 (but not a range).
>
> Anyway that's what I remember! Hope it helps.
>
> ml
>
>
> On Thu, Mar 13, 2014 at 11:27 AM, David Savage <davemssavage@gmail.com>wrote:
>
>> Great that works, thx! I probably would have never found that...
>>
>> It now makes me wonder in general when to use PRIMARY KEY (key1, key2) or
>> PRIMARY KEY ((key1, key2)), any examples would be welcome if you have the
>> time.
>>
>> Kind regards,
>>
>> Dave
>>
>>
>> On Thu, Mar 13, 2014 at 2:56 PM, Laing, Michael <
>> michael.laing@nytimes.com> wrote:
>>
>>> Create your table like this and it will work:
>>>
>>> CREATE TABLE test.documents (group text,id bigint,data
>>> map<text,text>,PRIMARY KEY ((group, id)));
>>>
>>> The extra parens catenate 'group' and 'id' into the partition key - IN
>>> will work on the last component of a partition key.
>>>
>>> ml
>>>
>>>
>>> On Thu, Mar 13, 2014 at 10:40 AM, David Savage <davemssavage@gmail.com>wrote:
>>>
>>>> Nope, upgraded to 2.0.5 and still get the same problem, I actually
>>>> simplified the problem a little in my first post, there's a composite
>>>> primary key involved as I need to partition ids into groups
>>>>
>>>> So the full CQL statements are:
>>>>
>>>>
>>>> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy',
>>>> 'replication_factor':3};
>>>>
>>>>
>>>>
>>>> CREATE TABLE test.documents (group text,id bigint,data
>>>> map<text,text>,PRIMARY KEY (group, id));
>>>>
>>>>
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (0,'test',{'count':'0'});
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (1,'test',{'count':'1'});
>>>>
>>>> INSERT INTO test.documents(id,group,data) VALUES
>>>> (2,'test',{'count':'2'});
>>>>
>>>>
>>>>
>>>> SELECT id,data FROM test.documents WHERE group='test' AND id IN (0,1,2);
>>>>
>>>>
>>>>
>>>> Thanks for your help.
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>>
>>>>
>>>> /Dave
>>>>
>>>>
>>>> On Thu, Mar 13, 2014 at 2:00 PM, David Savage <davemssavage@gmail.com>wrote:
>>>>
>>>>>  Hmmm that maybe the problem, I'm currently testing with 2.0.2 which
>>>>> got dragged in by the cassandra unit library I'm using for testing [1]
I
>>>>> will try to fix my build dependencies and retry, thx.
>>>>>
>>>>> /Dave
>>>>>
>>>>> [1] https://github.com/jsevellec/cassandra-unit
>>>>>
>>>>>
>>>>> On Thu, Mar 13, 2014 at 1:56 PM, Laing, Michael <
>>>>> michael.laing@nytimes.com> wrote:
>>>>>
>>>>>> I have no problem doing this w 2.0.5 - what version of C* are you
>>>>>> using? Or maybe I don't understand your data model... attach 'creates'
if
>>>>>> you don't mind.
>>>>>>
>>>>>> ml
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 13, 2014 at 9:24 AM, David Savage <davemssavage@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> Thanks for the help, unfortunately I'm not sure that's the problem,
>>>>>>> the id is the primary key on the documents table and the timestamp
>>>>>>> is the primary key on the eventlog table
>>>>>>>
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>> On Thursday, 13 March 2014, Peter Lin <woolfel@gmail.com>
wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> it's not clear to me if your "id" column is the KEY or just
a
>>>>>>>> regular column with secondary index.
>>>>>>>>
>>>>>>>> queries that have IN on non primary key columns isn't supported
>>>>>>>> yet. not sure if that answers your question.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 13, 2014 at 7:12 AM, David Savage <
>>>>>>>> davemssavage@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I'm experimenting using cassandra and have run across
an error
>>>>>>>>> message which I need a little more information on.
>>>>>>>>>
>>>>>>>>> The use case I'm experimenting with is a series of document
>>>>>>>>> updates (documents being an arbitrary map of key value
pairs), I would like
>>>>>>>>> to find the latest document updates after a specified
time period. I don't
>>>>>>>>> want to store many copies of the documents (one per update)
as the updates
>>>>>>>>> are often only to single keys in the map so that would
involve a lot of
>>>>>>>>> duplicated data.
>>>>>>>>>
>>>>>>>>> The solution I've found that seems to fit best in terms
of
>>>>>>>>> performance is to have two tables.
>>>>>>>>>
>>>>>>>>> One that has an event log of timeuuid -> docid and
a second that
>>>>>>>>> stores the documents themselves stored by docid ->
map<string, string>. I
>>>>>>>>> then run two queries, one to select ids that have changed
after a certain
>>>>>>>>> time:
>>>>>>>>>
>>>>>>>>> SELECT id FROM eventlog WHERE timestamp>=minTimeuuid($minimumTime)
>>>>>>>>>
>>>>>>>>> and then a second to select the actual documents themselves
>>>>>>>>>
>>>>>>>>> SELECT id, data FROM documents WHERE id IN (0, 1, 2,
3, 4, 5, 6,
>>>>>>>>> 7...)
>>>>>>>>>
>>>>>>>>> However this then explodes on query with the error message:
>>>>>>>>>
>>>>>>>>> "Cannot restrict PRIMARY KEY part id by IN relation as
a
>>>>>>>>> collection is selected by the query"
>>>>>>>>>
>>>>>>>>> Detective work lead me to these lines in
>>>>>>>>> org.apache.cassandra.cql3.statementsSelectStatement:
>>>>>>>>>
>>>>>>>>>                      // We only support IN for the last
name and
>>>>>>>>> for compact storage so far
>>>>>>>>>                     // TODO: #3885 allows us to extend
to non
>>>>>>>>> compact as well, but that remains to be done
>>>>>>>>>                     if (i != stmt.columnRestrictions.length
- 1)
>>>>>>>>>                         throw new
>>>>>>>>> InvalidRequestException(String.format("PRIMARY KEY part
%s cannot be
>>>>>>>>> restricted by IN relation", cname));
>>>>>>>>>                     else if (stmt.selectACollection())
>>>>>>>>>                         throw new
>>>>>>>>> InvalidRequestException(String.format("Cannot restrict
PRIMARY KEY part %s
>>>>>>>>> by IN relation as a collection is selected by the query",
cname));
>>>>>>>>>
>>>>>>>>> It seems like #3885 will allow support for the first
IF block
>>>>>>>>> above, but I don't think it will allow the second, am
I correct?
>>>>>>>>>
>>>>>>>>> Any pointers on how I can work around this would be greatly
>>>>>>>>> appreciated.
>>>>>>>>>
>>>>>>>>> Kind regards,
>>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message