cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laing, Michael" <michael.la...@nytimes.com>
Subject Re: Help with select IN query in cassandra
Date Mon, 01 Sep 2014 15:34:15 GMT
Did the OP propose that?


On Mon, Sep 1, 2014 at 10:53 AM, Jack Krupansky <jack@basetechnology.com>
wrote:

>   One comment on deletions – aren’t deletions kind of an anti-pattern for
> modern data processing, such as sensor data, time series data, and social
> media? I mean, isn’t it usually better to return a full history of the
> data, with some aging scheme, and manage the tracking of which values are
> “current” (or “recent”)? Shouldn’t we be looking for and promoting “write
> once” approaches as a much stronger preference/pattern? Or maybe I should
> say “write once and bulk delete on aging” rather than the exercise in
> futility of doing a massive number of deletes and updates in place?
>
> -- Jack Krupansky
>
>  *From:* Laing, Michael <michael.laing@nytimes.com>
> *Sent:* Monday, September 1, 2014 9:33 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Help with select IN query in cassandra
>
>  This should work for your query requirements - 2 tables w same info
> because disk is cheap and writes are fast so optimize for reads:
>
>  CREATE TABLE sensor_asset (
>   asset_id text,
>   event_time timestamp,
>   tuuid timeuuid,
>   sensor_reading map<text, text>,
>   sensor_serial_number text,
>   sensor_type int,
>   PRIMARY KEY ((asset_id), event_time)
> );
>
> CREATE TABLE sensor_tuuid (
>   asset_id text,
>   event_time timestamp,
>   tuuid timeuuid,
>   sensor_reading map<text, text>,
>   sensor_serial_number text,
>   sensor_type int,
>   PRIMARY KEY (tuuid)
> );
>
> 1. Give me all sensor data for an asset:
>
> select * from sensor_asset where asset_id = <asset>;
>
> 2. Give me sensor data that matches a set of timeuuids:
>
> select * from sensor_tuuid where tuuid in (<tuuid1>, <tuuid2>, ...);
>
> 3. Give me all sensor data for an asset collected after | before | between
> event_time(s):
>
> select * from sensor_asset where asset_id = <asset> and event_time > <ts1>;
>  select * from sensor_asset where asset_id = <asset> and event_time <
> <ts1>;
>  select * from sensor_asset where asset_id = <asset> and event_time <
> <ts1> and event_time > <ts2>;
>
> ***
>
> Many people (not me) handle sensor data, so there may be better overall
> approaches considering volumes, deletion, compaction etc.
>
> But the above is simple and should make your current approach workable as
> you iterate toward a complete solution.
>
> Cheers,
> ml
>
>
>
> On Sun, Aug 31, 2014 at 11:08 PM, Subodh Nijsure <subodh.nijsure@gmail.com
> > wrote:
>
>> Thanks for your help Michael.
>>
>> If specifying asset_id would help I can construct queries that can
>> include asset_id
>>
>> So I have been "playing" around with PRIMARY KEY definition and
>> following table definition
>>
>> CREATE TABLE sensor_info_table (
>>   asset_id text,
>>   event_time timestamp,
>>   "timestamp" timeuuid,
>>   sensor_reading map<text, text>,
>>   sensor_serial_number text,
>>   sensor_type int,
>>   PRIMARY KEY ((asset_id, "timestamp"), event_time)
>> );
>>
>> It does what I want to do, and I removed the index for timestamp item
>> since now it is part of primary key and thus my query like this works.
>>
>> SELECT * from sigsense.sensor_info_table where  asset_id='3' AND
>> timestamp IN (
>> 17830bb0-316a-11e4-800f-b888e30f5d17,16ddbdfe-316a-11e4-9f50-b888e30f5d17
>> );
>>
>> But now this doesn't work it give
>>
>> SELECT * from sensor_info_table where  asset_id='3' ;
>>
>> Bad Request: Partition key part timestamp must be restricted since
>> preceding part is
>>
>> I am keeping index on event_time as I sometime need to query something
>> "give me all data since time x" i.e. something like this works.
>>
>> SELECT * from sensor_info_table where  event_time > '2014-08-31
>> 16:54:02-0700' ALLOW FILTERING;
>>
>> However if I do this things then this don't work:
>>
>> SELECT * from sensor_info_table where  asset_id='3' AND event_time >
>> '2014-08-31 16:54:02-0700';
>>
>> Bad Request: Partition key part timestamp must be restricted since
>> preceding part is
>>
>> Also  I am not conformable with fact that I need to specify ALLOW
>> FILTERING.
>>
>> I guess cassandra schema design task asks designer to write down
>> queries before designing schema.
>>
>> For the above table definition I want to do following queries:
>>
>> - Give me all sensor data for given asset.
>> - Give me sensor data that matches given set of timeuuids
>> - Give me all sendor data for a given asset, that were collected after
>> | before | between  certain event_time.
>>
>> Given these query criteria how should  I construct my schema? One
>> thought has occurred to me is make three tables with each item
>> asset_id , event_time, timeuuid as primary keys and depending on type
>> of query choose the table to do query upon. That seems like a waste of
>> resources (disk, cpu ), also increasing insert times(!) but thats the
>> way things need to happen in cassandra world its okay. ( I am
>> two-three weeks into learning about cassandra).
>>
>> -Subodh
>>
>> On Sun, Aug 31, 2014 at 6:44 PM, Laing, Michael
>>  <michael.laing@nytimes.com> wrote:
>> > Oh it must be late - I missed the fact that you didn't want to specify
>> > asset_id. The above queries will still work but you have to use 'allow
>> > filtering' - generally not a good idea. I'll look again in the morning.
>> >
>> >
>> > On Sun, Aug 31, 2014 at 9:41 PM, Laing, Michael <
>> michael.laing@nytimes.com>
>> > wrote:
>> >>
>> >> Hmm. Because the clustering key is (event_time, "timestamp"),
>> event_time
>> >> must be specified as well - hopefully that info is available to the ux.
>> >>
>> >> Unfortunately you will then hit another problem with your query: you
>> are
>> >> selecting a collection field... this will not work with IN on
>> "timestamp".
>> >>
>> >> So you could select all the "timestamp"s for an asset_id/event_time:
>> >>>
>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time
=
>> >>> 1231234;
>> >>
>> >>
>> >> Or you could apply a range of "timestamp"s:
>> >>>
>> >>> SELECT * from sensor_info_table where asset_id = 'a' and event_time
=
>> >>> 1231234 and "timestamp" > 1d934af3-3178-11e4-ba8d-406c8f1838fa and
>> >>> "timestamp" < 20b82021-3178-11e4-abc2-406c8f1838fa;
>> >>
>> >>
>> >> BTW the secondary indices are not a good idea: high cardinality and of
>> no
>> >> use in this query that I can see.
>> >>
>> >> ml
>> >>
>> >>
>> >> On Sun, Aug 31, 2014 at 8:40 PM, Subodh Nijsure <
>> subodh.nijsure@gmail.com>
>> >> wrote:
>> >>>
>> >>> Not really event time stamp is created by the sensor when it reads
>> data
>> >>> and  timestamp is something server creates when inserting data into
>> >>> cassandra db.  At later point in time my django ux allows users to
>> browse
>> >>> this data and reference interesting data points via the timestamp
>> field. The
>> >>> timestamp field is my bridge between Sal and nosql world.
>> >>>
>> >>> Subodh
>> >>>
>> >>> On Aug 31, 2014 5:33 PM, "Laing, Michael" <michael.laing@nytimes.com>
>> >>> wrote:
>> >>>>
>> >>>> Are event_time and timestamp essentially representing the same
>> datetime?
>> >>>>
>> >>>> On Sunday, August 31, 2014, Subodh Nijsure <subodh.nijsure@gmail.com
>> >
>> >>>> wrote:
>> >>>>>
>> >>>>> I have following database schema
>> >>>>>
>> >>>>> CREATE TABLE sensor_info_table (
>> >>>>>   asset_id text,
>> >>>>>   event_time timestamp,
>> >>>>>   "timestamp" timeuuid,
>> >>>>>   sensor_reading map<text, text>,
>> >>>>>   sensor_serial_number text,
>> >>>>>   sensor_type int,
>> >>>>>   PRIMARY KEY ((asset_id), event_time, "timestamp")
>> >>>>> );
>> >>>>>
>> >>>>> CREATE INDEX event_time_index ON sensor_info_table (event_time);
>> >>>>>
>> >>>>> CREATE INDEX timestamp_index ON sensor_info_table ("timestamp");
>> >>>>>
>> >>>>> Now I am able to insert the data into this table, however I
am
>> unable
>> >>>>> to do following query where I want to select items with specific
>> >>>>> timeuuid values.
>> >>>>>
>> >>>>> It gives me following error.
>> >>>>>
>> >>>>> SELECT * from mydb.sensor_info_table where timestamp IN (
>> >>>>> bfdfa614-3166-11e4-a61d-b888e30f5d17 ,
>> >>>>> bf4521ac-3166-11e4-87a3-b888e30f5d17) ;
>> >>>>>
>> >>>>> Bad Request: PRIMARY KEY column "timestamp" cannot be restricted
>> >>>>> (preceding column "event_time" is either not restricted or by
a
>> non-EQ
>> >>>>> relation)
>> >>>>>
>> >>>>> What do I have to do to make this work?
>> >>>>>
>> >>>>> For what its worth I am using django for my front end development
>> and
>> >>>>> I am using "timestamp timeuuid" field as unique indentifier
to
>> >>>>> reference specific sensor reading from django framework -- since
>> >>>>> cassandra doesn't have way to generate unique id upon insert
(like
>> >>>>> old-style rdms's auto-fields).
>> >>>>>
>> >>>>>
>> >>>>> Below is software version info.
>> >>>>>
>> >>>>> show VERSION ; [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1
|
>> >>>>> Thrift protocol 19.39.0]
>> >>>>>
>> >>>>> I really don't understand what the error message preceeding
column
>> >>>>> "event_time" is either not restricted or by no-EQ relation?
>> >>>>>
>> >>>>> -Subodh Nijsure
>> >>
>> >>
>> >
>>
>
>

Mime
View raw message