Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2281D9FAA for ; Tue, 10 Apr 2012 14:24:29 +0000 (UTC) Received: (qmail 63645 invoked by uid 500); 10 Apr 2012 14:24:26 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 63619 invoked by uid 500); 10 Apr 2012 14:24:26 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 63607 invoked by uid 99); 10 Apr 2012 14:24:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Apr 2012 14:24:26 +0000 X-ASF-Spam-Status: No, hits=4.7 required=5.0 tests=HTML_MESSAGE,SPF_PASS,TO_NO_BRKTS_DIRECT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [194.126.128.137] (HELO cer31mx21.cirso.fr) (194.126.128.137) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Apr 2012 14:24:18 +0000 X-PJ: To: user@cassandra.apache.org Subject: Why so many SSTables? MIME-Version: 1.0 From: Romain HARDOUIN Message-ID: Date: Tue, 10 Apr 2012 16:24:01 +0200 Content-Type: multipart/alternative; boundary="=_alternative 004F30DFC12579DC_=" Message en plusieurs parties au format MIME --=_alternative 004F30DFC12579DC_= Content-Type: text/plain; charset="US-ASCII" Hi, We are surprised by the number of files generated by Cassandra. Our cluster consists of 9 nodes and each node handles about 35 GB. We're using Cassandra 1.0.6 with LeveledCompactionStrategy. We have 30 CF. We've got roughly 45,000 files under the keyspace directory on each node: ls -l /var/lib/cassandra/data/OurKeyspace/ | wc -l 44372 The biggest CF is spread over 38,000 files: ls -l Documents* | wc -l 37870 ls -l Documents*-Data.db | wc -l 7586 Many SSTable are about 4 MB: 19 MB -> 1 SSTable 12 MB -> 2 SSTables 11 MB -> 2 SSTables 9.2 MB -> 1 SSTable 7.0 MB to 7.9 MB -> 6 SSTables 6.0 MB to 6.4 MB -> 6 SSTables 5.0 MB to 5.4 MB -> 4 SSTables 4.0 MB to 4.7 MB -> 7139 SSTables 3.0 MB to 3.9 MB -> 258 SSTables 2.0 MB to 2.9 MB -> 35 SSTables 1.0 MB to 1.9 MB -> 13 SSTables 87 KB to 994 KB -> 87 SSTables 0 KB -> 32 SSTables FYI here is CF information: ColumnFamily: Documents Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider Key cache size / save period in seconds: 200000.0/14400 GC grace seconds: 1728000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Column Metadata: Column Name: refUUID (72656655554944) Validation Class: org.apache.cassandra.db.marshal.BytesType Index Name: refUUID_idx Index Type: KEYS Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy Compression Options: sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor Is it a bug? If not, how can we tune Cassandra to avoid this? Regards, Romain --=_alternative 004F30DFC12579DC_= Content-Type: text/html; charset="US-ASCII"
Hi,

We are surprised by the number of files generated by Cassandra.
Our cluster consists of 9 nodes and each node handles about 35 GB.
We're using Cassandra 1.0.6 with LeveledCompactionStrategy.
We have 30 CF.

We've got roughly 45,000 files under the keyspace directory on each node:
ls -l /var/lib/cassandra/data/OurKeyspace/ | wc -l
44372

The biggest CF is spread over 38,000 files:
ls -l Documents* | wc -l
37870

ls -l Documents*-Data.db | wc -l
7586

Many SSTable are about 4 MB:

19 MB -> 1 SSTable
12 MB -> 2 SSTables
11 MB -> 2 SSTables
9.2 MB -> 1 SSTable
7.0 MB to 7.9 MB -> 6 SSTables
6.0 MB to 6.4 MB -> 6 SSTables
5.0 MB to 5.4 MB -> 4 SSTables
4.0 MB to 4.7 MB -> 7139 SSTables
3.0 MB to 3.9 MB -> 258 SSTables
2.0 MB to 2.9 MB -> 35 SSTables
1.0 MB to 1.9 MB -> 13 SSTables
87 KB to  994 KB -> 87 SSTables
0 KB -> 32 SSTables

FYI here is CF information:

ColumnFamily: Documents
  Key Validation Class: org.apache.cassandra.db.marshal.BytesType
  Default column value validator: org.apache.cassandra.db.marshal.BytesType
  Columns sorted by: org.apache.cassandra.db.marshal.BytesType
  Row cache size / save period in seconds / keys to save : 0.0/0/all
  Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider
  Key cache size / save period in seconds: 200000.0/14400
  GC grace seconds: 1728000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Replicate on write: true
  Column Metadata:
    Column Name: refUUID (72656655554944)
      Validation Class: org.apache.cassandra.db.marshal.BytesType
      Index Name: refUUID_idx
      Index Type: KEYS
  Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy
  Compression Options:
    sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor

Is it a bug? If not, how can we tune Cassandra to avoid this?

Regards,

Romain --=_alternative 004F30DFC12579DC_=--