cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brent Haines (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10084) Very slow performance streaming a large query from a single CF
Date Mon, 17 Aug 2015 15:25:47 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699680#comment-14699680
] 

Brent Haines commented on CASSANDRA-10084:
------------------------------------------

FWIW, changing to LCS helped quite a lot. Before that, the query stream would halt periodically,
sometimes long enough for the query to time out. Now things are quite a bit more reliable,
but still slow. 

I don't have enough knowledge about how the collections are stored to understand why this
degradation is so severe. In case it's interesting to you, I'll add the detail that the performance
hit happens for all queries, with or without the data column in it (selecting all other columns
but that one doesn't change anything), and older queries for records that predate the addition
of the column are also slow. 

I wouldn't want to derail your current efforts on 3.0. It would be enough to just confirm
that the map column is a likely candidate for a 10x slowdown for streaming queries. The map
column is data we filter on post query. We could format a JSON string for a text column and
drop the map column. Given the volume of data that we process, this would be a semi-permanent
hack though; I doubt if we would ever convert it back to a collection given the cost of migration.
Still, it isn't a horrible approach.

I can add the profile later today too, in case I am running into something else. 


> Very slow performance streaming a large query from a single CF
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-10084
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10084
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.1.8
> 12GB EC2 instance
> 12 node cluster
> 32 concurrent reads
> 32 concurrent writes
> 6GB heap space
>            Reporter: Brent Haines
>         Attachments: cassandra.yaml
>
>
> We have a relatively simple column family that we use to track event data from different
providers. We have been utilizing it for some time. Here is what it looks like: 
> {code}
> CREATE TABLE data.stories_by_text (
>     ref_id timeuuid,
>     second_type text,
>     second_value text,
>     object_type text,
>     field_name text,
>     value text,
>     story_id timeuuid,
>     data map<text, text>,
>     PRIMARY KEY ((ref_id, second_type, second_value, object_type, field_name), value,
story_id)
> ) WITH CLUSTERING ORDER BY (value ASC, story_id ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = 'Searchable fields and actions in a story are indexed by ref id which
corresponds to a brand, app, app instance, or user.'
>     AND compaction = {'min_threshold': '4', 'cold_reads_to_omit': '0.0', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
> {code}
> We will, on a daily basis pull a query of the complete data for a given index, it will
look like this: 
> {code}
> select * from stories_by_text where ref_id = f0124740-2f5a-11e5-a113-03cdf3f3c6dc and
second_type = 'Day' and second_value = '20150812' and object_type = 'booshaka:user' and field_name
= 'hashedEmail';
> {code}
> In the past, we have been able to pull millions of records out of the CF in a few seconds.
We recently added the data column so that we could filter on event data and provide more detailed
analysis of activity for our reports. The data map, declared with 'data map<text, text>'
is very small; only 2 or 3 name/value pairs.
> Since we have added this column, our streaming query performance has gone straight to
hell. I just ran the above query and it took 46 minutes to read 86K rows and then it timed
out.
> I am uncertain what other data you need to see in order to diagnose this. We are using
STCS and are considering a change to Leveled Compaction. The table is repaired nightly and
the updates, which are at a very fast clip will only impact the partition key for today, while
the queries are for previous days only. 
> To my knowledge these queries no longer finish ever. They time out, even though I put
a 60 second timeout on the read for the cluster. I can watch it pause for 30 to 50 seconds
many times during the stream. 
> Again, this only started happening when we added the data column.
> Please let me know what else you need for this. It is having a very big impact on our
system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message