From user-return-19404-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Aug 2 21:19:58 2011 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 45167656B for ; Tue, 2 Aug 2011 21:19:58 +0000 (UTC) Received: (qmail 77258 invoked by uid 500); 2 Aug 2011 21:19:56 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 77216 invoked by uid 500); 2 Aug 2011 21:19:55 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 77208 invoked by uid 99); 2 Aug 2011 21:19:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Aug 2011 21:19:55 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of JEREMIAH.JORDAN@morningstar.com designates 216.228.224.32 as permitted sender) Received: from [216.228.224.32] (HELO mx85.morningstar.com) (216.228.224.32) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 02 Aug 2011 21:19:49 +0000 Received: from 172.28.18.112 ([172.28.18.112]) by msex85.morningstar.com ([172.28.0.30]) with Microsoft Exchange Server HTTP-DAV ; Tue, 2 Aug 2011 21:19:27 +0000 Received: from us-wash-ch2ljq1 by msex85.morningstar.com; 02 Aug 2011 16:19:27 -0500 Subject: Re: 8 million Cassandra data files on disk From: Jeremiah Jordan To: user@cassandra.apache.org In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Date: Tue, 02 Aug 2011 16:19:27 -0500 Message-ID: <1312319967.4058.6.camel@us-wash-ch2ljq1.morningstar.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.2 Connect with jconsole and run garbage collection. All of the files that have a -Compacted with the same name will get deleted the next time a full garbage collection runs, or when the node is restarted. They have already been combined into new files, the old ones just haven't been deleted yet. On Tue, 2011-08-02 at 16:09 -0400, Yiming Sun wrote: > Hi, >=20 > I am new to Cassandra, and am hoping someone could help me understand > the (large amount of small) data files on disk that Cassandra > generates.=20 >=20 > The reason we are using Cassandra is because we are dealing with > thousands to millions of small text files on disk, so we are > experimenting with Cassandra hoping that by dropping the files > contents into Cassandra, it will achieve more efficient disk usage > because Cassandra is going to aggregate them into bigger files (one > file per column family, according to the wiki). >=20 > But after we pushed a subset of the files into a single node Cassandra > v0.7.0 instance, we noted that in the Cassandra data directory for the > keyspace, there are 8.5 million very small files, most are named >=20 > -e-.Filter.db > -e-.Compacted.db > -e-.Index.db > -e-.Statistics.db >=20 > and among these files, the Compacted.db are always empty, Filter and > Index are under 100 bytes, and Statistics are around 4k. >=20 > What are these files? Why are there so many of them? We originally > hope that Cassandra was going to solve our issue with the small files > we have, but now it doesn't seem to help -- we still end up with tons > of small files. Is there any way to reduce/combine these small > files? >=20 > Thanks. >=20 > -- Y.