On Mon 04 Feb 2013 04:42:12 AM CST, Paul van Hoven wrote:
> Thanks for the answer. Can anybody else answer my other two questions,
> because my problem is not solved yet?
>
> 2013/2/3 Edward Capriolo <edlinuxguru@gmail.com>:
>> This was the issue that prompted the "WITH FILTERING ALLOWED":
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-4915
>>
>> Cassandra's storage system can only optimize certain queries.
>>
>> On Sun, Feb 3, 2013 at 2:07 PM, Paul van Hoven
>> <paul.van.hoven@googlemail.com> wrote:
>>> I'm not sure if I understood your answer.
>>>
>>>> When you have GB or TB of data any query that adds "WITH FILTERING"
>>>> will not work at scale.
>>> 1. You mean any query that requires "with filtering" is slow?
>>>
>>>> Secondary indexes need at least one equality. If you want to do this
>>>> at scale you might need a different design.
>>> 2. And what design would be recommendable then?
>>>
>>> 3. How should the query look like such that it would scale?
>>>
>>>
>>>
>>> 2013/2/3 Edward Capriolo <edlinuxguru@gmail.com>:
>>>> Secondary indexes need at least one equality. If you want to do this
>>>> at scale you might need a different design.
>>>>
>>>> Using WITH FILTERING and LIMIT 10 is simply grabbing the first few
>>>> random rows that match your criteria.
>>>>
>>>> When you have GB or TB of data any query that adds "WITH FILTERING"
>>>> will not work at scale.
>>>>
>>>> This is why it was added to the language CQL lets you do some queries
>>>> that "seem fast" when your developing with 10 rows, without this
>>>> clause you would not know if a query is fast because it hits a
>>>> cassandra index, or it is just fast because the results were found in
>>>> the first 10 rows.
>>>>
>>>> Edward
>>>>
>>>> On Sun, Feb 3, 2013 at 10:56 AM, Paul van Hoven
>>>> <paul.van.hoven@googlemail.com> wrote:
>>>>> Okay, here is the schema (actually it is in german, but I translated
>>>>> the column names such that it is easier to read for an international
>>>>> audience):
>>>>>
>>>>> cqlsh:demodb> describe table offerten_log_archiv;
>>>>>
>>>>> CREATE TABLE offerten_log_archiv (
>>>>> offerte_id int PRIMARY KEY,
>>>>> aktionen int,
>>>>> angezeigt bigint,
>>>>> datum timestamp,
>>>>> gutschrift bigint,
>>>>> kampagne_id int,
>>>>> klicks int,
>>>>> klicks_ungueltig int,
>>>>> kosten bigint,
>>>>> statistik_id bigint,
>>>>> stunden int,
>>>>> werbeflaeche_id int,
>>>>> werbemittel_id int
>>>>> ) WITH
>>>>> bloom_filter_fp_chance=0.010000 AND
>>>>> caching='KEYS_ONLY' AND
>>>>> comment='' AND
>>>>> dclocal_read_repair_chance=0.000000 AND
>>>>> gc_grace_seconds=864000 AND
>>>>> read_repair_chance=0.100000 AND
>>>>> replicate_on_write='true' AND
>>>>> compaction={'class': 'SizeTieredCompactionStrategy'};
>>>>>
>>>>> CREATE INDEX datum_key ON offerten_log_archiv (datum);
>>>>>
>>>>> CREATE INDEX stunden_key ON offerten_log_archiv (stunden);
>>>>>
>>>>> cqlsh:demodb>
>>>>>
>>>>> This is the query I'm trying to perform:
>>>>> cqlsh:demodb> select * from ola where date > '2013-01-01' and hour
= 0
>>>>> limit 10 allow filtering;
>>>>> Request did not complete within rpc_timeout.
>>>>>
>>>>> ola = offerten_log_archiv (table name)
>>>>> hour = stunde (column name)
>>>>> date = datum (column name)
>>>>>
>>>>> I hope this information makes my problem more clear.
>>>>>
>>>>>
>>>>>
>>>>> 2013/2/3 Edward Capriolo <edlinuxguru@gmail.com>:
>>>>>> Without seeing your schema it is hard to say, but in some cases "ALLOW
>>>>>> FILTERING" might be considered "EXPECT THIS COULD BE SLOW". It could
>>>>>> mean the query is not hitting and index and is going to page through
>>>>>> large amounts of data.
>>>>>>
>>>>>> On Sun, Feb 3, 2013 at 9:42 AM, Paul van Hoven
>>>>>> <paul.van.hoven@googlemail.com> wrote:
>>>>>>> After figuring out how to use the ">" operator on an secondary
index I
>>>>>>> noticed that in a column family of about 5.5 million datasets
I get a
>>>>>>> rpc_timeout when trying to read data from this table. In the
concrete
>>>>>>> situation I want to request data younger than January 1 2013.
The
>>>>>>> number of rows that should be affected are about 1 million. When
doing
>>>>>>> the request I get a timeout error:
>>>>>>>
>>>>>>> cqlsh:demodb> select * from ola where date > '2013-01-01'
and hour = 0
>>>>>>> limit 10 allow filtering;
>>>>>>> Request did not complete within rpc_timeout.
>>>>>>>
>>>>>>> Actually I find this very confusing since I would except an
>>>>>>> exceptional performance gain in comparison to a similar sql query.
>>>>>>> Therefore, I think the query I'm performing is not appropriate
for
>>>>>>> cassandra, although I would do a query like that in this manner
on a
>>>>>>> sql database. So my question now is: How should I perfrom this
query
>>>>>>> on cassandra?
according to your query, maybe this will be better
CREATE TABLE offerten_log_archiv (
offerte_id int,
aktionen int,
angezeigt bigint,
datum timestamp,
gutschrift bigint,
kampagne_id int,
klicks int,
klicks_ungueltig int,
kosten bigint,
statistik_id bigint,
stunden int,
werbeflaeche_id int,
werbemittel_id int
PRIMARY KEY(offerte_id, datum)
) WITH
...
|