Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81B67F82E for ; Tue, 28 May 2013 18:46:49 +0000 (UTC) Received: (qmail 64100 invoked by uid 500); 28 May 2013 18:46:47 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 64002 invoked by uid 500); 28 May 2013 18:46:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 63986 invoked by uid 99); 28 May 2013 18:46:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2013 18:46:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cayiroglu@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2013 18:46:39 +0000 Received: by mail-we0-f170.google.com with SMTP id u59so5029447wes.15 for ; Tue, 28 May 2013 11:46:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=SURiLCQfx9nEaCoPBDUSeR+EKbUcncVY/B+rN0+u/oQ=; b=aTXUtSuVC/o55qt4OqHZATmZZqU85KnS5FDxsHRZuM/ech483Dk0YtxWLLi7RBBJxv srbg0tZluI6wg66kodALpdT2JlyZX0F/VZpaMqyGT0kT/bCEn3GN0314sqW3fbDbq2G6 +Bcu7SZ3OOq7A/+kfUUYlD0TCLU2tu5XVOG94Bp0Sk1I5R6OhTmCPVgqvlIJdUcBhkOl 9/PGf8Vv6lLYKZpzkNsjv+VUZs/DyyxDMbaB/WTARjnPV7hvskMKLRxe5td+FOWA1+cW /uQ3REf3yqyWBkiNrwECh/C8eMOT76LNipunkaOx/YpaynBXZNk7TWzqwUCgESDcEXzA +ZHQ== MIME-Version: 1.0 X-Received: by 10.180.185.44 with SMTP id ez12mr12997373wic.7.1369766779015; Tue, 28 May 2013 11:46:19 -0700 (PDT) Received: by 10.194.0.238 with HTTP; Tue, 28 May 2013 11:46:18 -0700 (PDT) Date: Tue, 28 May 2013 20:46:18 +0200 Message-ID: Subject: data clean up problem From: cem To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c2257449d83e04ddcbae0d X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2257449d83e04ddcbae0d Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Experts, We have general problem about cleaning up data from the disk. I need to free the disk space after retention period and the customer wants to dimension the disk space base on that. After running multiple performance tests with TTL of 1 day we saw that the compaction couldn't keep up with the request rate. Disks were getting full after 3 days. There were also a lot of sstables that are older than 1 day after 3 days. Things that we tried: -Change the compaction strategy to leveled. (helped a bit but not much) -Use big sstable size (10G) with leveled compaction to have more aggressive compaction.(helped a bit but not much) -Upgrade Cassandra from 1.0 to 1.2 to use TTL histograms (didn't help at all since it has key overlapping estimation algorithm that generates %100 match. Although we don't have...) Our column family structure is like this: Event_data_cf: (we store event data. Event_id is randomly generated and each event has attributes like location=3Dlondon) row data event id data blob timeseries_cf: (key is the attribute that we want to index. It can be location=3Dlondon, we didnt use secondary indexes because the indexes are dynamic.) row data index key time series of event id (event1_id, event2_id=85.) timeseries_inv_cf: (this is used for removing event by event row key. ) row data event id set of index keys Candidate Solution: Implementing time range partitions. Each partition will have column family set and will be managed by client. Suppose that you want to have 7 days retention period. Then you can configure the partition size as 1 day and have 7 active partitions at any time. Then you can drop inactive partitions (older that 7 days). Dropping will immediate remove the data from the disk. (With proper Cassandra.yaml configuration) Storing an event: Find the current partition p1 store to event_data to Event_data_cf_p1 store to indexes to timeseries_cff_p1 store to inverted indexes to timeseries_inv_cf_p1 A time range query with an index: Find the all partitions belongs to that time range Do read starting from the first partition until you reach to limit ..... Could you please provide your comments and concerns ? Is there any other option that we can try? What do you think about the candidate solution? Does anyone have the same issue? How would you solve it in another way? Thanks in advance! Cem --001a11c2257449d83e04ddcbae0d Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
Hi Experts,

We ha= ve general problem about cleaning up data from the disk. I need to free the disk space=A0after=A0retention period and the customer wants to dimension the disk space base on that.

A= fter running multiple performance tests with TTL of 1 day we saw that the compaction=A0couldn't=A0keep up=A0with=A0the request rate. Disks were getting full after 3 days. There were also a lot of sstabl= es that are older than 1 day after 3 days.

T= hings that we tried:

-= Change the compaction strategy to=A0leveled. (helped a bit but not much)

-= Use big sstable size (10G) with=A0leveled compaction to have more aggressive compaction.(helped a bit but not much)=A0

-= Upgrade Cassandra from 1.0 to 1.2 to use TTL=A0histograms=A0(didn't=A0help at all since it has key overlapping estimation algorithm that generates %100 match.=A0Although=A0we=A0don't=A0have...)

O= ur column family structure is like this:=A0

E= vent_data_cf: (we store event data. Event_id =A0is randomly generated and e= ach event has attributes like location=3Dlondon)=A0

r= ow=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 data

e= vent id=A0=A0=A0=A0=A0=A0=A0=A0=A0 data blob=A0

t= imeseries_cf: (key is the attribute that we want to index. It can be location=3Dlondon, we didnt use secondary indexes=A0because=A0the indexes are dynamic.)=A0

r= ow=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 data

i= ndex key=A0=A0=A0=A0=A0=A0 time series of event id (event1_id, event2_id=85.)

t= imeseries_inv_cf: (this is used for removing event by event row key. )

r= ow=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 data

e= vent id=A0=A0=A0=A0=A0=A0=A0=A0=A0 set of index keys=A0

C= andidate Solution: Implementing time range partitions.=A0

E= ach partition will have column=A0family=A0set and will be managed by client.

S= uppose that you want to have 7 days=A0retention=A0period. Then you can configure the partition size as 1 day and have 7 active partit= ions at any time. Then you can drop inactive partitions (older that 7 days). Dropping will=A0immediate=A0remove the data from the disk. (With proper Cassandra.yaml configuration)

S= toring an event:=A0

F= ind the current=A0partition p1

s= tore to event_data to=A0Event_data_cf_p1 =A0

s= tore to indexes to timeseries_cff_p1 =A0

s= tore to inverted indexes to timeseries_inv_cf_p1=A0

<= br>

A time range query with an index:

F= ind the all partitions belongs to that time range

D= o read starting from the first partition until you reach to limit

.= ....

C= ould you please provide your comments and concerns ?=A0

I= s there any other option that we can try?

W= hat do you think about the candidate solution?

D= oes anyone have the same issue? How would you solve it in another way?


Thanks in advance!

= Cem

--001a11c2257449d83e04ddcbae0d--