Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7CBDFD1C7 for ; Tue, 28 May 2013 19:27:11 +0000 (UTC) Received: (qmail 31524 invoked by uid 500); 28 May 2013 19:27:09 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 31405 invoked by uid 500); 28 May 2013 19:27:08 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 31389 invoked by uid 99); 28 May 2013 19:27:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2013 19:27:08 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [192.174.58.133] (HELO XEDGEB.nrel.gov) (192.174.58.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2013 19:27:01 +0000 Received: from XHUBA.nrel.gov (10.20.4.58) by XEDGEB.nrel.gov (192.174.58.133) with Microsoft SMTP Server (TLS) id 8.3.298.1; Tue, 28 May 2013 13:26:37 -0600 Received: from MAILBOX2.nrel.gov ([fe80::48b0:b121:8465:5e5]) by XHUBA.nrel.gov ([::1]) with mapi; Tue, 28 May 2013 13:26:39 -0600 From: "Hiller, Dean" To: "user@cassandra.apache.org" Date: Tue, 28 May 2013 13:26:37 -0600 Subject: Re: data clean up problem Thread-Topic: data clean up problem Thread-Index: Ac5b2UYo+NWD+d7aRg2xTgUGph8FOA== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.3.4.130416 acceptlanguage: en-US Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Don't do any delete !=3D "need to free the disk space after retention perio= d" which you have in both your emails. My understanding is TTL is an expir= y and just like tombstones will only be really deleted upon a compaction(ie= . You do have deletes via TTL from the sound of it). If you have TTL of 1 = day, it does not immediately go away until the next compaction is run and t= hen the compaction may not run on all rows??? I am not quite sure there e= xcept the data stays in the sstable until it is compacted into a new sstabl= e and is thrown away then as long as TTL has passed. Dean From: cem > Reply-To: "user@cassandra.apache.org" > Date: Tuesday, May 28, 2013 1:17 PM To: "user@cassandra.apache.org" > Subject: Re: data clean up problem Thanks for the answer but it is already set to 0 since I don't do any delet= e. Cem On Tue, May 28, 2013 at 9:03 PM, Edward Capriolo > wrote: You need to change the gc_grace time of the column family. It defaults to 1= 0 days. By default the tombstones will not go away for 10 days. On Tue, May 28, 2013 at 2:46 PM, cem > wrote: Hi Experts, We have general problem about cleaning up data from the disk. I need to fre= e the disk space after retention period and the customer wants to dimension= the disk space base on that. After running multiple performance tests with TTL of 1 day we saw that the = compaction couldn't keep up with the request rate. Disks were getting full = after 3 days. There were also a lot of sstables that are older than 1 day a= fter 3 days. Things that we tried: -Change the compaction strategy to leveled. (helped a bit but not much) -Use big sstable size (10G) with leveled compaction to have more aggressive= compaction.(helped a bit but not much) -Upgrade Cassandra from 1.0 to 1.2 to use TTL histograms (didn't help at al= l since it has key overlapping estimation algorithm that generates %100 mat= ch. Although we don't have...) Our column family structure is like this: Event_data_cf: (we store event data. Event_id is randomly generated and ea= ch event has attributes like location=3Dlondon) row data event id data blob timeseries_cf: (key is the attribute that we want to index. It can be locat= ion=3Dlondon, we didnt use secondary indexes because the indexes are dynami= c.) row data index key time series of event id (event1_id, event2_id=85.) timeseries_inv_cf: (this is used for removing event by event row key. ) row data event id set of index keys Candidate Solution: Implementing time range partitions. Each partition will have column family set and will be managed by client. Suppose that you want to have 7 days retention period. Then you can configu= re the partition size as 1 day and have 7 active partitions at any time. Th= en you can drop inactive partitions (older that 7 days). Dropping will imme= diate remove the data from the disk. (With proper Cassandra.yaml configurat= ion) Storing an event: Find the current partition p1 store to event_data to Event_data_cf_p1 store to indexes to timeseries_cff_p1 store to inverted indexes to timeseries_inv_cf_p1 A time range query with an index: Find the all partitions belongs to that time range Do read starting from the first partition until you reach to limit ..... Could you please provide your comments and concerns ? Is there any other option that we can try? What do you think about the candidate solution? Does anyone have the same issue? How would you solve it in another way? Thanks in advance! Cem