From user-return-35000-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Jul 2 14:58:25 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9BD16CC79 for ; Tue, 2 Jul 2013 14:58:25 +0000 (UTC) Received: (qmail 85090 invoked by uid 500); 2 Jul 2013 14:58:25 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 84808 invoked by uid 500); 2 Jul 2013 14:58:24 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 84220 invoked by uid 99); 2 Jul 2013 14:58:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jul 2013 14:58:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rwille@fold3.com designates 38.101.149.73 as permitted sender) Received: from [38.101.149.73] (HELO mx02.iarchives.com) (38.101.149.73) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jul 2013 14:58:20 +0000 Received: from mx02.iarchives.com (localhost [127.0.0.1]) by mx02.iarchives.com (Postfix) with ESMTP id 099CFC1322 for ; Tue, 2 Jul 2013 08:57:59 -0600 (MDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=fold3.com; h=date :subject:from:to:message-id:in-reply-to:mime-version :content-type; s=m1; bh=/+SbZG4vIuoaDGdi6mBKpXbPogmnfMPzbZKyy/Bc 7pE=; b=k1bpS643hNnpsmGGMEbIJkkyTaFtbekVWp69Naopu+a/cgWxSI2OWljo AempRGXn54ITCrzJL1MkmvYawWyhOzyq+mJ7fSyBKFxSlHataT+r5J62AK2a1LsJ tEIkBcgx/a8Bx/DdZArSYGK2VtKAsk/YalAbJVDSeaelNfD10iU= Received: from PANDORA.iarchives.com (pandora.iarchives.com [192.168.100.88]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx02.iarchives.com (Postfix) with ESMTPS id D9834C118C for ; Tue, 2 Jul 2013 08:57:58 -0600 (MDT) Received: from [10.88.88.10] (192.168.97.117) by PANDORA.iarchives.com (192.168.100.88) with Microsoft SMTP Server (TLS) id 14.1.438.0; Tue, 2 Jul 2013 08:56:55 -0600 User-Agent: Microsoft-MacOutlook/14.14.0.111121 Date: Tue, 2 Jul 2013 08:57:51 -0600 Subject: Re: very inefficient operation with tombstones From: Robert Wille To: Message-ID: Thread-Topic: very inefficient operation with tombstones In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="B_3455600278_31412384" X-Virus-Checked: Checked by ClamAV on apache.org --B_3455600278_31412384 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit I've seen the same thing From: Sylvain Lebresne Reply-To: Date: Tue, 2 Jul 2013 08:32:06 +0200 To: "user@cassandra.apache.org" Subject: Re: very inefficient operation with tombstones This is https://issues.apache.org/jira/browse/CASSANDRA-5677. -- Sylvain On Tue, Jul 2, 2013 at 6:04 AM, Mohica Jasha wrote: > Querying a table with 5000 thousands tombstones take 3 minutes to complete! > But Querying the same table with the same data pattern with 10,000 entries > takes a fraction of second to complete! > > > Details: > 1. created the following table: > CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'}; > use test; > CREATE TABLE job_index ( stage text, "timestamp" text, PRIMARY KEY > (stage, "timestamp")); > > 2. inserted 5000 entries to the table: > INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00000001' ); > INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00000002' ); > .... > INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00004999' ); > INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00005000' ); > > 3. flushed the table: > nodetool flush test job_index > > 4. deleted the 5000 entries: > DELETE from job_index WHERE stage ='a' AND timestamp = '00000001' ; > DELETE from job_index WHERE stage ='a' AND timestamp = '00000002' ; > ... > DELETE from job_index WHERE stage ='a' AND timestamp = '00004999' ; > DELETE from job_index WHERE stage ='a' AND timestamp = '00005000' ; > > 5. flushed the table: > nodetool flush test job_index > > 6. querying the table takes 3 minutes to complete: > cqlsh:test> SELECT * from job_index limit 20000; > tracing: > http://pastebin.com/jH2rZN2X > > while query was getting executed I saw a lot of GC entries in cassandra's log: > DEBUG [ScheduledTasks:1] 2013-07-01 23:47:59,221 GCInspector.java (line 121) > GC for ParNew: 30 ms for 6 collections, 263993608 used; max is 2093809664 > DEBUG [ScheduledTasks:1] 2013-07-01 23:48:00,222 GCInspector.java (line 121) > GC for ParNew: 29 ms for 6 collections, 186209616 used; max is 2093809664 > DEBUG [ScheduledTasks:1] 2013-07-01 23:48:01,223 GCInspector.java (line 121) > GC for ParNew: 29 ms for 6 collections, 108731464 used; max is 2093809664 > > It seems that something very inefficient is happening in managing tombstones. > > If I start with a clean table and do the following: > 1. insert 5000 entries > 2. flush to disk > 3. insert new 5000 entries > 4. flush to disk > Querying the job_index for all the 10,000 entries takes a fraction of second > to complete: > tracing: > http://pastebin.com/scUN9JrP > > The fact that iterating over 5000 tombstones takes 3 minutes but iterating > over 10,000 live cells takes fraction of a second to suggest that something > very inefficient is happening in managing tombstones. > > I appreciate if any developer can look into this. > > -M > > > > > > > --B_3455600278_31412384 Content-Type: text/html; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable
I've seen the same thing

From: Sylvain Lebresne = <sylvain@datastax.com>
Reply-To: <user@cassandra.apache.org>
Date: Tue, 2 Jul 2013 08:32:06 +0200
To: "user@ca= ssandra.apache.org" <user@= cassandra.apache.org>
Subject: Re: very inefficient operation with tombstones

<= meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-1">
This is https://issues.apache.org/jira/browse/CASSANDRA-5677.

<= /div>
--
Sylvain


<= div class=3D"gmail_quote">On Tue, Jul 2, 2013 at 6:04 AM, Mohica Jasha <mohica.= jasha@gmail.com> wrote:
Querying a table with 5000 thousands tombstones take 3 minutes to= complete!
But Querying the same table with the same data pattern = with 10,000 entries takes a fraction of second to complete!


Details:
1. created the following table:
CREATE KEYSPACE test WITH replication =3D {'class': 'SimpleStrat= egy', 'replication_factor': '1'};
use test;
CREATE TABLE= job_index (   stage text,   "timestamp" text,   PRIMARY KEY = (stage, "timestamp")); 

2. inserted 5000 entri= es to the table:
INSERT INTO job_index (stage, timestamp) VAL= UES ( 'a', '00000001' );
INSERT INTO job_index (stage, timestamp) = VALUES ( 'a', '00000002' );
....
INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00004999' );
<= div>INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00005000' );

3. flushed the table:
nodetool flush t= est job_index

4. deleted the 5000 entries:
DELETE from job_index WHERE stage =3D'a' AND timestamp =3D '00000001'= ;
DELETE from job_index WHERE stage =3D'a' AND timestamp =3D '0000000= 2' ;
...
DELETE from job_index WHERE stage =3D'a' AND= timestamp =3D '00004999' ;
DELETE from job_index WHERE stage =3D'a' A= ND timestamp =3D '00005000' ;

5. flushed the ta= ble:
nodetool flush test job_index

6. querying the table takes 3 minutes to complete:
cqlsh:t= est> SELECT * from job_index limit 20000;
tracing:
<= div>http://pastebin.c= om/jH2rZN2X

while query was get= ting executed I saw a lot of GC entries in cassandra's log:
D= EBUG [ScheduledTasks:1] 2013-07-01 23:47:59,221 GCInspector.java (line 121) = GC for ParNew: 30 ms for 6 collections, 263993608 used; max is 2093809664
DEBUG [ScheduledTasks:1] 2013-07-01 23:48:00,222 GCInspector.java (l= ine 121) GC for ParNew: 29 ms for 6 collections, 186209616 used; max is 2093= 809664
DEBUG [ScheduledTasks:1] 2013-07-01 23:48:01,223 GCInspecto= r.java (line 121) GC for ParNew: 29 ms for 6 collections, 108731464 used; ma= x is 2093809664

It seems that something very ineffi= cient is happening in managing tombstones.

If I sta= rt with a clean table and do the following:
1. insert 5000 entries=
2. flush to disk
3. insert new 5000 entries
4= . flush to disk
Querying the job_index for all the 10,000 entries = takes a fraction of second to complete:
tracing:

The fact that iterating over 5000 tombston= es takes 3 minutes but iterating over 10,000 live cells takes fraction of a = second to suggest that something very inefficient is happening in managing t= ombstones.

I appreciate if any developer can look i= nto this.

-M








--B_3455600278_31412384--