cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuri Shkuro ...@uber.com>
Subject Re: Hotspots on Time Series based Model
Date Tue, 17 Nov 2015 14:17:10 GMT
You can also subdivide hourly partition further by adding an artificial
"bucket" field to the partition key, which you populate with a random
number say between 0 and 10. When you query, you fan out 10 queries, one
for each bucket, and you need to do a manual merge of the resilts. This way
you pay the cost at read time, but you spread the writes across more nodes,
avoiding a single hotspot for each hour.
On Nov 17, 2015 8:22 AM, "Jack Krupansky" <jack.krupansky@gmail.com> wrote:

> I'd be more comfortable keeping partition size below 10MB, but the more
> critical factor is the write rate. In a technical sense a single node (and
> its replicas) and a single partition will be a hotspot since all writes for
> an extended period of time will go to that single node and partition (for
> one hour), but the real issue is whether the rate of writes to that single
> node/partition is excessive. In your specific case 416K rows written per
> hour works out to only 115 rows written per second which should be a fairly
> light load.
>
> Your query spans two hours which means two partitions which won't
> necessarily be on the same node. Cassandra may or may not optimize that
> case. If it works for you, great, but if it doesn't, you can split it into
> two queries that you can submit in parallel, each scanning rows in only a
> single partition.
>
> -- Jack Krupansky
>
> On Tue, Nov 17, 2015 at 5:29 AM, Chandra Sekar KR <
> chandrasekarkr@hotmail.com> wrote:
>
>> Hi,
>>
>>
>> I have a time-series based table with the below structure and partition
>> size/volumetrics. The purpose of this table is to enable range based scans
>> on log_ts and filter the log_id, so it can be further used in the main
>> table (EVENT_LOG) for checking the actual data. The EVENT_LOG_BY_DATE acts
>> as a lookup (index) to the main table.
>>
>>
>> CREATE TABLE EVENT_LOG_BY_DATE (
>>
>>   YEAR INT,
>>
>>   MONTH INT,
>>
>>   DAY INT,
>>
>>   HOUR INT,
>>
>>   LOG_TS TIMESTAMP,
>>
>>   LOG_ID VARINT,
>>
>>   PRIMARY KEY ((YEAR, MONTH, DAY, HOUR), LOG_TS))
>>
>> WITH CLUSTERING ORDER BY (LOG_TS DESC);
>>
>>
>> SELECT LOG_TS, LOG_ID FROM EVENT_LOG_BY_DATE
>>
>>   WHERE YEAR = 2015 AND
>>
>>   MONTH = 11 AND
>>
>>   DAY = 15 AND
>>
>>   HOUR IN (10,11) AND
>>
>>   LOG_TS >= '2015-11-15 10:00:00+0000' AND
>>
>>   LOG_TS <= '2015-11-15 11:00:00+0000';
>>
>>
>> Average daily volume of records for this table is ~10million & the avg.
>> row size is ~40B. The partition size for an hour comes close to 13MB with
>> each partition spanning 416K rows. Will the partition on PRIMARY KEY
>> ((YEAR, MONTH, DAY, HOUR) cause any hotspot issues on a node given the
>> hourly data size is ~13MB ?
>>
>>
>> Is there any alternate way to model the above time-series based table
>> that enable range scans?
>>
>>
>> Regards, Chandra KR
>>
>
>

Mime
View raw message