Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 053B360D3 for ; Tue, 2 Aug 2011 21:37:40 +0000 (UTC) Received: (qmail 15822 invoked by uid 500); 2 Aug 2011 21:37:38 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 15748 invoked by uid 500); 2 Aug 2011 21:37:37 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 15740 invoked by uid 99); 2 Aug 2011 21:37:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Aug 2011 21:37:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates 74.125.82.172 as permitted sender) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Aug 2011 21:37:32 +0000 Received: by wyj26 with SMTP id 26so150090wyj.31 for ; Tue, 02 Aug 2011 14:37:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=pshyVtLkF8JZTXawxGa8bm81NSDkp7eSPIafsinXvHw=; b=hzqzEiE44hO34Dcim7FgiG0h1YYhPcBjtD8B+T1vRnW7iSSLUQ3p9Zzs6z7DuP13G4 Q4CyD3veNd0lo94bXXdwaplXrkLxZMUMg6AhWSHor98aAZ9Zc2dvywKhQ7RxznSkjX6I Tb0AMrcl5JoCidErQAtPUj4xXc9O3iFg/s3+U= Received: by 10.216.81.7 with SMTP id l7mr2114583wee.69.1312321030090; Tue, 02 Aug 2011 14:37:10 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.135.36 with HTTP; Tue, 2 Aug 2011 14:36:40 -0700 (PDT) In-Reply-To: References: <1312319967.4058.6.camel@us-wash-ch2ljq1.morningstar.com> From: Jonathan Ellis Date: Tue, 2 Aug 2011 16:36:40 -0500 Message-ID: Subject: Re: 8 million Cassandra data files on disk To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I don't remember a removing-compacted-files bug in 0.7.0, but you should absolutely upgrade to 0.7.8 for several dozen other fixes, including some severe ones -- see NEWS.txt. On Tue, Aug 2, 2011 at 4:29 PM, Yiming Sun wrote: > Hi Jeremiah, > > Thank you for the information - it certainly is a relief.=A0 Two question= s > though: > > 1. I came across an old thread which seemed to be saying 0.7.0 cassandra = has > a bug and doesn't remove these compact files properly.=A0 Should we upgra= de to > a newer version that has this bug fixed? > > 2. Do we must do the garbage collection via Jconsole manually?=A0 Is ther= e > anyway I can force the GC in our code? (we are using Hector as our java > client). > > Thanks! > > > > On Tue, Aug 2, 2011 at 5:19 PM, Jeremiah Jordan > wrote: >> >> Connect with jconsole and run garbage collection. >> All of the files that have a -Compacted with the same name will get >> deleted the next time a full garbage collection runs, or when the node >> is restarted. =A0They have already been combined into new files, the old >> ones just haven't been deleted yet. >> >> On Tue, 2011-08-02 at 16:09 -0400, Yiming Sun wrote: >> > Hi, >> > >> > I am new to Cassandra, and am hoping someone could help me understand >> > the (large amount of small) data files on disk that Cassandra >> > generates. >> > >> > The reason we are using Cassandra is because we are dealing with >> > thousands to millions of small text files on disk, so we are >> > experimenting with Cassandra hoping that by dropping the files >> > contents into Cassandra, it will achieve more efficient disk usage >> > because Cassandra is going to aggregate them into bigger files (one >> > file per column family, according to the wiki). >> > >> > But after we pushed a subset of the files into a single node Cassandra >> > v0.7.0 instance, we noted that in the Cassandra data directory for the >> > keyspace, there are 8.5 million very small files, most are named >> > >> > =A0 =A0 -e-.Filter.db >> > =A0 =A0 -e-.Compacted.db >> > =A0 =A0 -e-.Index.db >> > =A0 =A0 -e-.Statistics.db >> > >> > and among these files, the Compacted.db are always empty, =A0Filter an= d >> > Index are under 100 bytes, and Statistics are around 4k. >> > >> > What are these files? Why are there so many of them? =A0We originally >> > hope that Cassandra was going to solve our issue with the small files >> > we have, but now it doesn't seem to help -- we still end up with tons >> > of small files. =A0 Is there any way to reduce/combine these small >> > files? >> > >> > Thanks. >> > >> > -- Y. >> > > --=20 Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com