incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laing, Michael" <michael.la...@nytimes.com>
Subject Re: Versioning in cassandra
Date Tue, 03 Sep 2013 21:17:13 GMT
I use the technique described in my previous message to handle millions of
messages and their versions.

Actually, I use timeuuid's instead of timestamps, as they have more
'uniqueness'. Also I index my maps by a timeuuid that is the complement
(based on a future date) of a current timeuuid. Since maps are kept sorted
by key, this means I can just pop off the first one to get the most recent.

The downside of this approach is that you get more stuff returned to you
from Cassandra than you need. To mitigate that I queue a job to examine and
correct the situation if, upon doing a read, the number of versions for a
particular key is higher than some threshold, e.g. 50. There are many ways
to approach this problem.

Our actual implementation proceeds to another level, as we also have
replicas of versions. This happens because we process important
transactions in parallel and can expect up to 9 replicas of each version.
We journal them all and use them for reporting latencies in our processing
pipelines as well as for replay when we need to recover application state.

Regards,

Michael


On Tue, Sep 3, 2013 at 3:15 PM, Laing, Michael <michael.laing@nytimes.com>wrote:

> try the following. -ml
>
> -- put this in <file> and run using 'cqlsh -f <file>
>
> DROP KEYSPACE latest;
>
> CREATE KEYSPACE latest WITH replication = {
>     'class': 'SimpleStrategy',
>     'replication_factor' : 1
> };
>
> USE latest;
>
> CREATE TABLE file (
>     parentid text, -- row_key, same for each version
>     id text, -- column_key, same for each version
>     contenttype map<timestamp, text>, -- differs by version, version is
> the key to the map
>     PRIMARY KEY (parentid, id)
> );
>
> update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where
> parentid = 'd1' and id = 'f1';
> update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where
> parentid = 'd1' and id = 'f1';
> update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where
> parentid = 'd1' and id = 'f2';
> update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where
> parentid = 'd1' and id = 'f2';
>
> select * from file where parentid = 'd1';
>
> -- returns:
>
> -- parentid | id | contenttype
>
> ------------+----+--------------------------------------------------------------------------
> --       d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05
> 00:00:00-0500': 'pdf2'}
> --       d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05
> 00:00:00-0500': 'pdf4'}
>
> -- use an app to pop off the latest version from the map
>
> -- map other varying fields using the same technique as used for
> contenttype
>
>
>
> On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra <mishra.vivs@gmail.com>wrote:
>
>> create table file(id text , parentid text,contenttype text,version
>> timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING
>> ORDER BY (version DESC);
>>
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1');
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1');
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>> create index on file(parentid);
>>
>>
>> select * from file where id='f1' and parentid='d1' limit 1;
>>
>> select * from file where parentid='d1' limit 1;
>>
>>
>> Will it work for you?
>>
>> -Vivek
>>
>>
>>
>>
>> On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra <mishra.vivs@gmail.com>wrote:
>>
>>> My bad. I did miss out to read "latest version" part.
>>>
>>> -Vivek
>>>
>>>
>>> On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah <
>>> muhammed.dawood@gmail.com> wrote:
>>>
>>>> I have tried with both the options creating secondary index and also
>>>> tried adding parentid to primary key, but I am getting all the files with
>>>> parentid 'yyy', what I want is the latest version of file with the
>>>> combination of parentid, fileid. Say below are the records inserted in the
>>>> file table:
>>>>
>>>> insert into file (id, parentid, version, contenttype, description,
>>>> name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>>>> insert into file (id, parentid, version, contenttype, description,
>>>> name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>>> insert into file (id, parentid, version, contenttype, description,
>>>> name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>>> insert into file (id, parentid, version, contenttype, description,
>>>> name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');
>>>>
>>>> I want to write a query which returns me second and last record and not
>>>> the first and third record, because for the first and third record there
>>>> exists a latest version, for the combination of id and parentid.
>>>>
>>>> I am confused If at all this is achievable, please suggest.
>>>>
>>>> Dawood
>>>>
>>>>
>>>>
>>>> On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra <mishra.vivs@gmail.com>wrote:
>>>>
>>>>> create secondary index over parentid.
>>>>> OR
>>>>> make it part of clustering key
>>>>>
>>>>> -Vivek
>>>>>
>>>>>
>>>>> On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah <
>>>>> muhammed.dawood@gmail.com> wrote:
>>>>>
>>>>>> Jan,
>>>>>>
>>>>>> The solution you gave works spot on, but there is one more
>>>>>> requirement I forgot to mention. Following is my table structure
>>>>>>
>>>>>> CREATE TABLE file (
>>>>>>   id text,
>>>>>>   contenttype text,
>>>>>>   createdby text,
>>>>>>   createdtime timestamp,
>>>>>>   description text,
>>>>>>   name text,
>>>>>>   parentid text,
>>>>>>   version timestamp,
>>>>>>   PRIMARY KEY (id, version)
>>>>>>
>>>>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>>>>
>>>>>>
>>>>>> The query (select * from file where id = 'xxx' limit 1;) provided
>>>>>> solves the problem of finding the latest version file. But I have
one more
>>>>>> requirement of finding all the latest version files having parentid
say
>>>>>> 'yyy'.
>>>>>>
>>>>>> Please suggest how can this query be achieved.
>>>>>>
>>>>>> Dawood
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
>>>>>> muhammed.dawood@gmail.com> wrote:
>>>>>>
>>>>>>> In my case version can be timestamp as well. What do you suggest
>>>>>>> version number to be, do you see any problems if I keep version
as counter
>>>>>>> / timestamp ?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
>>>>>>> jan.algermissen@nordsc.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On 02.09.2013, at 20:44, dawood abdullah <muhammed.dawood@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > Requirement is like I have a column family say File
>>>>>>>> >
>>>>>>>> > create table file(id text primary key, fname text, version
int,
>>>>>>>> mimetype text, content text);
>>>>>>>> >
>>>>>>>> > Say, I have few records inserted, when I modify an existing
>>>>>>>> record (content is modified) a new version needs to be created.
As I need
>>>>>>>> to have provision to revert to back any old version whenever
required.
>>>>>>>> >
>>>>>>>>
>>>>>>>> So, can version be a timestamp? Or does it need to be an
integer?
>>>>>>>>
>>>>>>>> In the former case, make use of C*'s ordering like so:
>>>>>>>>
>>>>>>>> CREATE TABLE file (
>>>>>>>>    file_id text,
>>>>>>>>    version timestamp,
>>>>>>>>    fname text,
>>>>>>>>    ....
>>>>>>>>    PRIMARY KEY (file_id,version)
>>>>>>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>>>>>>
>>>>>>>> Get the latest file version with
>>>>>>>>
>>>>>>>> select * from file where file_id = 'xxx' limit 1;
>>>>>>>>
>>>>>>>> If it has to be an integer, use counter columns.
>>>>>>>>
>>>>>>>> Jan
>>>>>>>>
>>>>>>>>
>>>>>>>> > Regards,
>>>>>>>> > Dawood
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
>>>>>>>> jan.algermissen@nordsc.com> wrote:
>>>>>>>> > Hi Dawood,
>>>>>>>> >
>>>>>>>> > On 02.09.2013, at 16:36, dawood abdullah <
>>>>>>>> muhammed.dawood@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > > Hi
>>>>>>>> > > I have a requirement of versioning to be done in
Cassandra.
>>>>>>>> > >
>>>>>>>> > > Following is my column family definition
>>>>>>>> > >
>>>>>>>> > > create table file_details(id text primary key,
fname text,
>>>>>>>> version int, mimetype text);
>>>>>>>> > >
>>>>>>>> > > I have a secondary index created on fname column.
>>>>>>>> > >
>>>>>>>> > > Whenever I do an insert for the same 'fname', the
version
>>>>>>>> should be incremented. And when I retrieve a row with fname
it should
>>>>>>>> return me the latest version row.
>>>>>>>> > >
>>>>>>>> > > Is there a better way to do in Cassandra? Please
suggest what
>>>>>>>> approach needs to be taken.
>>>>>>>> >
>>>>>>>> > Can you explain more about your use case?
>>>>>>>> >
>>>>>>>> > If the version need not be a small number, but could
be a
>>>>>>>> timestamp, you could make use of C*'s ordering feature ,
have the database
>>>>>>>> set the new version as a timestamp and retrieve the latest
one with a
>>>>>>>> simple LIMIT 1 query. (I'll explain more when this is an
option for you).
>>>>>>>> >
>>>>>>>> > Jan
>>>>>>>> >
>>>>>>>> > P.S. Me being a REST/HTTP head, an alarm rings when
I see
>>>>>>>> 'version' next to 'mimetype' :-) What exactly are you versioning
here?
>>>>>>>> Maybe we can even change the situation from a functional
POV?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > >
>>>>>>>> > > Regards,
>>>>>>>> > >
>>>>>>>> > > Dawood
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message