From user-return-15373-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Sun Apr 03 17:46:55 2011 Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 92984 invoked from network); 3 Apr 2011 17:46:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Apr 2011 17:46:54 -0000 Received: (qmail 40309 invoked by uid 500); 3 Apr 2011 17:46:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 40286 invoked by uid 500); 3 Apr 2011 17:46:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 40278 invoked by uid 99); 3 Apr 2011 17:46:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Apr 2011 17:46:52 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chensheng2010@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 03 Apr 2011 17:46:46 +0000 Received: by iwn39 with SMTP id 39so5946398iwn.31 for ; Sun, 03 Apr 2011 10:46:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=UKgrEeiP93/CFVefEROPsQ/Lf3bMI+tcWk7ZYl+TX8U=; b=lO3TRnfj9975cnBhYyRmGtCCURW8CF9I9q8E57/7Rb3/cpNjp/MzyjffcMwdjVB2lv xutOchhJUVIk+jfoV69br1QRLbhW/hKTkpv7SMSeMNh+ChtnUZZd1yJx3PqDxaCnBFWQ 1YiIAu8+tPDWJmGLVGrePnZ+l0EOO3yAcPh2M= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=g4zCb/+EijuTpbnIxafnIKyzJdaS8N7E0ASZoNBfM9WdN2M3z6oDXrI/q/5bLFsie/ Q3SvgGWwCMcv1Tkc27h1lyNqeXnNIPIQJH2o/xo7+ULUx1WEHLYZ4CR/b6pjQ0FVqoKH rxgCpeHPzwvy1Qb2xW26Kbz5BrDwOBj+0NSCo= MIME-Version: 1.0 Received: by 10.42.136.1 with SMTP id r1mr9210385ict.15.1301852784668; Sun, 03 Apr 2011 10:46:24 -0700 (PDT) Received: by 10.42.174.7 with HTTP; Sun, 3 Apr 2011 10:46:24 -0700 (PDT) In-Reply-To: <066E0041-CBAD-4C7D-8BF0-2D8AEB89002F@thelastpickle.com> References: <066E0041-CBAD-4C7D-8BF0-2D8AEB89002F@thelastpickle.com> Date: Mon, 4 Apr 2011 01:46:24 +0800 Message-ID: Subject: Re: Endless minor compactions after heavy inserts From: Sheng Chen To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=90e6ba6e8334c7b08f04a00738d9 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba6e8334c7b08f04a00738d9 Content-Type: text/plain; charset=ISO-8859-1 I think if i can keep a single sstable file in a proper size, the hot data/index files may be able to fit into memory at least in some occasions. In my use case, I want to use cassandra for storage of a large amount of log data. There will be multiple nodes, and each node has 10*2TB disks to hold as much data as possible, ideally 20TB (about 100 billion rows) in one node. Reading operations will be much less than writing. A reading latency within 1 second is acceptable. Is it possible? Do you have advice on this design? Thank you. Sheng 2011/4/3 aaron morton > With only one data file your reads would use the least amount of IO to find > the data. > > Most people have multiple nodes and probably fewer disks, so each node may > have a TB or two of data. How much capacity do your 10 disks give ? Will you > be running multiple nodes in production ? > > Aaron > > > > On 2 Apr 2011, at 12:45, Sheng Chen wrote: > > Thank you very much. > > The major compaction will merge everything into one big file., which would > be very large. > Is there any way to control the number or size of files created by major > compaction? > Or, is there a recommended number or size of files for cassandra to handle? > > Thanks. I see the trigger of my minor compaction is OperationsInMillions. > It is a number of operations in total, which I thought was in a second. > > Cheers, > Sheng > > > 2011/4/1 aaron morton > >> If you are doing some sort of bulk load you can disable minor compactions >> by setting the min_compaction_threshold and max_compaction_threshold to 0 . >> Then once your insert is complete run a major compaction via nodetool before >> turning the minor compaction back on. >> >> You can also reduce the compaction threads priority, see >> compaction_thread_priority in the yaml file. >> >> The memtable will be flushed when either the MB or ops throughput is >> triggered. If you are seeing a lot of memtables smaller than the MB >> threshold then the ops threshold is probably been triggered. Look for a log >> message at INFO level starting with "Enqueuing flush of Memtable" that will >> tell you how many bytes and ops the memtable had when it was flushed. Trying >> increasing the ops threshold and see what happens. >> >> You're change in the compaction threshold may not have an an effect >> because the compaction process was already running. >> >> AFAIK the best way to get the best out of your 10 disks will be to use a >> dedicated mirror for the commit log and a stripe set for the data. >> >> Hope that helps. >> Aaron >> >> On 1 Apr 2011, at 14:52, Sheng Chen wrote: >> >> > I've got a single node of cassandra 0.7.4, and I used the java stress >> tool to insert about 100 million records. >> > The inserts took about 6 hours (45k inserts/sec) but the following minor >> compactions last for 2 days and the pending compaction jobs are still >> increasing. >> > >> > From jconsole I can read the MemtableThroughputInMB=1499, >> MemtableOperationsInMillions=7.0 >> > But in my data directory, I got hundreds of 438MB data files, which >> should be the cause of the minor compactions. >> > >> > I tried to set compaction threshold by nodetool, but it didn't seem to >> take effects (no change in pending compaction tasks). >> > After restarting the node, my setting is lost. >> > >> > I want to distribute the read load in my disks (10 disks in xfs, LVM), >> so I don't want to do a major compaction. >> > So, what can I do to keep the sstable file in a reasonable size, or to >> make the minor compactions faster? >> > >> > Thank you in advance. >> > Sheng >> > >> >> > > --90e6ba6e8334c7b08f04a00738d9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I think if i can keep a single sstable file in a proper size, the hot=20 data/index files may be able to fit into memory at least in some=20 occasions.

In my use case, I want to use cassandra for storage of a = large amount of log data.
There will be multiple nodes, and each node ha= s 10*2TB disks to hold as much data as possible, ideally 20TB (about 100 bi= llion rows) in one node.
Reading operations will be much less than writing. A reading latency within= 1 second is acceptable.

Is it possible? Do you have advice on this = design?
Thank you.

Sheng



2011/4/3 aaron morton <aaron@thelastpickle.com>
With only one data file your reads woul= d use the least amount of IO to find the data.=A0

Most p= eople have multiple nodes and probably fewer disks, so each node may have a= TB or two of data. How much capacity do your 10 disks give ? Will you be r= unning multiple nodes in production ?

Aaron


=A0
On 2 Apr= 2011, at 12:45, Sheng Chen wrote:

Thank= you very much.

The major compaction will merge everything into one big file= ., which would be very large.
Is there any way to control the num= ber or size of files created by major compaction?
Or, is there a recommended=A0number or size of files for cassandra to = handle?

Thanks. I see the trigger of my minor comp= action is OperationsInMillions. It is a number of operations in total, whic= h I thought was in a second.

Cheers,
Sheng


2011/4/1 aaron morton <aaron@thelastpickle.com>=
If you are doing some sort of bulk load you can disable minor compactions b= y setting the min_compaction_threshold and max_compaction_threshold to 0 . = Then once your insert is complete run a major compaction via nodetool befor= e turning the minor compaction back on.

You can also reduce the compaction threads priority, see compaction_thread_= priority in the yaml file.

The memtable will be flushed when either the MB or ops throughput is trigge= red. If you are seeing a lot of memtables smaller than the MB threshold the= n the ops threshold is probably been triggered. Look for a log message at I= NFO level starting with "Enqueuing flush of Memtable" that will t= ell you how many bytes and ops the memtable had when it was flushed. Trying= increasing the ops threshold and see what happens.

You're change in the compaction threshold may not have an an effect bec= ause the compaction process was already running.

AFAIK the best way to get the best out of your 10 disks will be to use a de= dicated mirror for the commit log and a =A0stripe set for the data.

Hope that helps.
Aaron

On 1 Apr 2011, at 14:52, Sheng Chen wrote:

> I've got a single node of cassandra 0.7.4, and I used the java str= ess tool to insert about 100 million records.
> The inserts took about 6 hours (45k inserts/sec) but the following min= or compactions last for 2 days and the pending compaction jobs are still in= creasing.
>
> From jconsole I can read the MemtableThroughputInMB=3D1499, MemtableOp= erationsInMillions=3D7.0
> But in my data directory, I got hundreds of 438MB data files, which sh= ould be the cause of the minor compactions.
>
> I tried to set compaction threshold by nodetool, but it didn't see= m to take effects (no change in pending compaction tasks).
> After restarting the node, my setting is lost.
>
> I want to distribute the read load in my disks (10 disks in xfs, LVM),= so I don't want to do a major compaction.
> So, what can I do to keep the sstable file in a reasonable size, or to= make the minor compactions faster?
>
> Thank you in advance.
> Sheng
>



--90e6ba6e8334c7b08f04a00738d9--