Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B517DD3EC for ; Fri, 7 Dec 2012 03:43:42 +0000 (UTC) Received: (qmail 53738 invoked by uid 500); 7 Dec 2012 03:43:40 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 53702 invoked by uid 500); 7 Dec 2012 03:43:39 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53691 invoked by uid 99); 7 Dec 2012 03:43:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Dec 2012 03:43:39 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a82.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Dec 2012 03:43:31 +0000 Received: from homiemail-a82.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a82.g.dreamhost.com (Postfix) with ESMTP id 1D0CD282061 for ; Thu, 6 Dec 2012 19:43:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=OMu2/WEB87dKnkCCbCywodHq5F g=; b=lUAdFGLCBC9hMWzTYwPUitXGQBSGiAu6hddVapr+auyhsuTyVuagvq4G1C JcgAw+Uu6OIhX9Do3ejDHFuFgS9fgu/D2gKyLlUhFDPwPK4CkNHoPnOfkIr0idcV pMXMsEzmnMe9oUfuBe9U7Ob8i2B3Wdi9W6kBkPBVuLMnhEQbk= Received: from [172.20.10.2] (unknown [118.148.182.132]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a82.g.dreamhost.com (Postfix) with ESMTPSA id E891828205F for ; Thu, 6 Dec 2012 19:43:17 -0800 (PST) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9" Message-Id: <584110EF-2349-4785-91CA-25BF9EE21C1E@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction. Date: Fri, 7 Dec 2012 16:43:04 +1300 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 > Meaning terabyte size databases.=20 >=20 Lots of people have TB sized systems. Just add more nodes.=20 300 to 400 Gb is just a rough guideline. The bigger picture is = considering how routine and non routine maintenance tasks are going to = be carried out.=20 Cheers =20 ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 7/12/2012, at 4:38 AM, Edward Capriolo wrote: > http://wiki.apache.org/cassandra/LargeDataSetConsiderations >=20 >=20 > On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L = wrote: > =93Having so much data on each node is a potential bad day.=94 >=20 > =20 >=20 > Is this discussed somewhere on the Cassandra documentation (limits, = practices etc)? We are also trying to load up quite a lot of data and = have hit memory issues (bloom filter etc.) in 1.0.10. I would like to = read up on big data usage of Cassandra. Meaning terabyte size = databases.=20 >=20 > =20 >=20 > I do get your point about the amount of time required to recover = downed node. But this 300-400MB business is interesting to me. >=20 > =20 >=20 > Thanks in advance. >=20 > =20 >=20 > Wade >=20 > =20 >=20 > From: aaron morton [mailto:aaron@thelastpickle.com]=20 > Sent: Wednesday, December 05, 2012 9:23 PM > To: user@cassandra.apache.org > Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered = compaction. >=20 > =20 >=20 > Basically we were successful on two of the nodes. They both took ~2 = days and 11 hours to complete and at the end we saw one very large file = ~900GB and the rest much smaller (the overall size decreased). This is = what we expected! >=20 > I would recommend having up to 300MB to 400MB per node on a regular = HDD with 1GB networking.=20 >=20 > =20 >=20 > But on the 3rd node, we suspect major compaction didn't actually = finish it's job=85 >=20 > The file list looks odd. Check the time stamps, on the files. You = should not have files older than when compaction started.=20 >=20 > =20 >=20 > 8GB heap=20 >=20 > The default is 4GB max now days.=20 >=20 > =20 >=20 > 1) Do you expect problems with the 3rd node during 2 weeks more of = operations, in the conditions seen below?=20 >=20 > I cannot answer that.=20 >=20 > =20 >=20 > 2) Should we restart with leveled compaction next year?=20 >=20 > I would run some tests to see how it works for you workload.=20 >=20 > =20 >=20 > 4) Should we consider increasing the cluster capacity? >=20 > IMHO yes. >=20 > You may also want to do some experiments with turing compression on if = it not already enabled.=20 >=20 > =20 >=20 > Having so much data on each node is a potential bad day. If instead = you had to move or repair one of those nodes how long would it take for = cassandra to stream all the data over ? (Or to rsync the data over.) How = long does it take to run nodetool repair on the node ? >=20 > =20 >=20 > With RF 3, if you lose a node you have lost your redundancy. It's = important to have a plan about how to get it back and how long it may = take. =20 >=20 > =20 >=20 > Hope that helps.=20 >=20 > =20 >=20 > ----------------- >=20 > Aaron Morton >=20 > Freelance Cassandra Developer >=20 > New Zealand >=20 > =20 >=20 > @aaronmorton >=20 > http://www.thelastpickle.com >=20 > =20 >=20 > On 6/12/2012, at 3:40 AM, Alexandru Sicoe wrote: >=20 >=20 >=20 >=20 > Hi guys, > Sorry for the late follow-up but I waited to run major compactions on = all 3 nodes at a time before replying with my findings. >=20 > Basically we were successful on two of the nodes. They both took ~2 = days and 11 hours to complete and at the end we saw one very large file = ~900GB and the rest much smaller (the overall size decreased). This is = what we expected! >=20 > But on the 3rd node, we suspect major compaction didn't actually = finish it's job. First of all nodetool compact returned much earlier = than the rest - after one day and 15 hrs. Secondly from the 1.4TBs = initially on the node only about 36GB were freed up (almost the same = size as before). Saw nothing in the server log (debug not enabled). = Below I pasted some more details about file sizes before and after = compaction on this third node and disk occupancy. >=20 > The situation is maybe not so dramatic for us because in less than 2 = weeks we will have a down time till after the new year. During this we = can completely delete all the data in the cluster and start fresh with = TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by = Alain - thanks). >=20 > Questions: >=20 > 1) Do you expect problems with the 3rd node during 2 weeks more of = operations, in the conditions seen below?=20 > [Note: we expect the minor compactions to continue building up files = but never really getting to compacting the large file and thus not = needing much temporarily extra disk space]. >=20 > 2) Should we restart with leveled compaction next year?=20 > [Note: Aaron was right, we have 1 week rows which get deleted after 1 = month which means older rows end up in big files =3D> to free up space = with SizeTiered we will have no choice but run major compactions which = we don't know if they will work provided that we get at ~1TB / node / 1 = month. You can see we are at the limit!] >=20 > 3) In case we keep SizeTiered: >=20 > - How can we improve the performance of our major compactions? (we = left all config parameters as default). Would increasing compactions = throughput interfere with writes and reads? What about multi-threaded = compactions? >=20 > - Do we still need to run regular repair operations as well? Do = these also do a major compaction or are they completely separate = operations?=20 >=20 > [Note: we have 3 nodes with RF=3D2 and inserting at consistency level = 1 and reading at consistency level ALL. We read primarily for exporting = reasons - we export 1 week worth of data at a time]. >=20 > 4) Should we consider increasing the cluster capacity? > [We generate ~5million new rows every week which shouldn't come close = to the hundreds of millions of rows on a node mentioned by Aaron which = are the volumes that would create problems with bloom filters and = indexes]. >=20 > Cheers, > Alex > ------------------ >=20 > The situation in the data folder=20 >=20 > before calling nodetool comapact: >=20 > du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db > 444G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db > 376G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db > 305G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db > 39G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db > 78G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db > 81G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db > 205M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db > 20G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db > 20G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db > 20G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db > 4.9G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db > 4.9G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db > 4.9G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db > 333M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db > 92M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db > 92M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db > 99M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db > 2.5G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db > 1.4T total >=20 > after nodetool comapact returned: >=20 > du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db > 444G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db > 910G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db > 19G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db > 19G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db > 5.0G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db > 4.8G = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db > 338M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db > 339M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db > 339M = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db > 98M =20 >=20 >=20 > Looking at the disk occupancy for the logical partition where the data = folder is in: >=20 > df /data_bst > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sdb1 2927242720 1482502260 1444740460 51% /data_bst >=20 >=20 > and the situation in the cluster >=20 > nodetool -h $HOSTNAME ring (before major compaction) > Address DC Rack Status State Load = Effective-Ownership Token =20 > = 113427455640312821154458202477256070484 =20 > 10.146.44.17 datacenter1 rack1 Up Normal 1.37 TB = 66.67% 0 =20 > 10.146.44.18 datacenter1 rack1 Up Normal 1.04 TB = 66.67% 56713727820156410577229101238628035242 =20 > 10.146.44.32 datacenter1 rack1 Up Normal 1.14 TB = 66.67% 113427455640312821154458202477256070484 >=20 > nodetool -h $HOSTNAME ring (after major compaction) (Note we were = inserting data in the meantime) > Address DC Rack Status State Load = Effective-Ownership Token =20 > = 113427455640312821154458202477256070484 =20 > 10.146.44.17 datacenter1 rack1 Up Normal 1.38 TB = 66.67% 0 =20 > 10.146.44.18 datacenter1 rack1 Up Normal 1.08 TB = 66.67% 56713727820156410577229101238628035242 =20 > 10.146.44.32 datacenter1 rack1 Up Normal 1.19 TB = 66.67% 113427455640312821154458202477256070484 >=20 >=20 > =20 >=20 > On Fri, Nov 23, 2012 at 2:16 AM, aaron morton = wrote: >=20 > > =46rom what I know having too much data on one node is bad, not = really sure why, but I think that performance will go down due to the = size of indexes and bloom filters (I may be wrong on the reasons but I'm = quite sure you can't store too much data per node). >=20 > If you have many hundreds of millions of rows on a node the memory = needed for bloom filters and index sampling can be significant. These = can both be tuned. >=20 > If you have 1.1T per node the time to do a compaction, repair or = upgrade may be very significant. Also the time taken to copy this data = should you need to remove or replace a node may be prohibitive. >=20 >=20 > > 2. Switch to Leveled compaction strategy. >=20 > I would avoid making a change like that on an unstable / at risk = system. >=20 > > - Our usage pattern is write once, read once (export) and delete = once! >=20 > The column TTL may be of use to you, it removes the need to do a = delete. >=20 > > - We were thinking of relying on the automatic minor compactions to = free up space for us but as.. > There are some usage patterns which make life harder for STS. For = example if you have very long lived rows that are written to and deleted = a lot. Row fragments that have been around for a while will end up in = bigger files, and these files get compacted less often. >=20 > In this situation, if you are running low on disk space and you think = there is a lot of deleted data in there, I would run a major compaction. = A word or warning though, if do this you will need to continue to do it = regularly. Major compaction creates a single big file, that will not get = compaction often. There are ways to resolve this, and moving to LDB may = help in the future. >=20 > If you are stuck and worried about disk space it's what I would do. = Once you are stable again then look at LDB = http://www.datastax.com/dev/blog/when-to-use-leveled-compaction >=20 > Cheers >=20 > ----------------- > Aaron Morton > Freelance Cassandra Developer > New Zealand >=20 > @aaronmorton > http://www.thelastpickle.com >=20 >=20 > On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ wrote: >=20 > > Hi Alexandru, > > > > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 = disk per node for the data dir and separate disk for the commitlog, 12 = cores, 24 GB RAM" > > > > I think you should tune your architecture in a very different way. = =46rom what I know having too much data on one node is bad, not really = sure why, but I think that performance will go down due to the size of = indexes and bloom filters (I may be wrong on the reasons but I'm quite = sure you can't store too much data per node). > > > > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) = would be better if you have the choice. > > > > "(12GB to Cassandra heap)." > > > > The max heap recommanded is 8GB because if you use more than these = 8GB the Gc jobs will start decreasing your performance. > > > > "We now have 1.1 TB worth of data per node (RF =3D 2)." > > > > You should use RF=3D3 unless one out of consistency or SPOF doesn't = matter to you. > > > > With RF=3D2 you are obliged to write at CL.one to remove the single = point of failure. > > > > "1. Start issuing regular major compactions (nodetool compact). > > - This is not recommended: > > - Stops minor compactions. > > - Major performance hit on node (very bad for us because = need to be taking data all the time)." > > > > Actually, major compaction *does not* stop minor compactions. What = happens is that due to the size of the size of the sstable that remains = after your major compaction, it will never be compacted with the = upcoming new sstables, and because of that, your read performance will = go down until you run an other major compaction. > > > > "2. Switch to Leveled compaction strategy. > > - It is mentioned to help with deletes and disk space usage. = Can someone confirm?" > > > > =46rom what I know, Leveled compaction will not free disk space. It = will allow you to use a greater percentage of your total disk space (50% = max for sized tier compaction vs about 80% for leveled compaction) > > > > "Our usage pattern is write once, read once (export) and delete = once! " > > > > In this case, I think that leveled compaction fits your needs. > > > > "Can anyone suggest which (if any) is better? Are there better = solutions?" > > > > Are your sstable compressed ? You have 2 types of built-in = compression and you may use them depending on the model of each of your = CF. > > > > see: = http://www.datastax.com/docs/1.1/operations/tuning#configure-compression > > > > Alain > > > > 2012/11/22 Alexandru Sicoe > > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 = disk per node for the data dir and separate disk for the commitlog, 12 = cores, 24 GB RAM (12GB to Cassandra heap). > > >=20 > =20 >=20 > =20 >=20 >=20 --Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252

Meaning terabyte size = databases. 

Lots of = people have TB sized systems. Just add more nodes. 
300 = to 400 Gb is just a rough guideline. The bigger picture is considering = how routine and non routine maintenance tasks are going to be carried = out. 

Cheers
  
http://www.thelastpickle.com

On 7/12/2012, at 4:38 AM, Edward Capriolo <edlinuxguru@gmail.com> = wrote:

http:= //wiki.apache.org/cassandra/LargeDataSetConsiderations


On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L = <wade.l.poziombka@intel.com> wrote:

=93Having so much data on each node is = a potential bad day.=94

 

Is = this discussed somewhere on the Cassandra documentation (limits, = practices etc)?  We are also trying to load up quite a lot of data = and have hit memory issues (bloom filter etc.) in 1.0.10.  I would = like to read up on big data usage of Cassandra.  Meaning terabyte size databases.  =

 

I do get your point about the amount of time = required to recover downed node. But this 300-400MB business is = interesting to me.

 

Thanks = in advance.

 

Wade

 

From: aaron morton [mailto:aaron@thelastpickle.com]
Sent: Wednesday, December 05, 2012 9:23 PM
To: user@cassandra.apache.org
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with = Size-Tiered compaction.

 

Basically we were successful on two of the nodes. = They both took ~2 days and 11 hours to complete and at the end we saw = one very large file ~900GB and the rest much smaller (the overall size = decreased). This is what we expected!

I would recommend having up to 300MB = to 400MB per node on a regular HDD with 1GB = networking. 

 

But on the 3rd node, we suspect major compaction = didn't actually finish it's job=85

The file list looks odd. Check the = time stamps, on the files. You should not have files older than when = compaction started. 

 

8GB heap 

The default is 4GB max now = days. 

 

1) Do you expect problems with the 3rd node during 2 = weeks more of operations, in the conditions seen = below? 

I cannot answer = that. 

 

2) Should we restart with leveled compaction next = year? 

I would run some tests to see how it = works for you workload. 

 

4) Should we consider increasing the cluster = capacity?

IMHO yes.

You may also want to do some experiments = with turing compression on if it not already = enabled. 

 

Having so much data on each node is a = potential bad day. If instead you had to move or repair one of those = nodes how long would it take for cassandra to stream all the data over ? = (Or to rsync the data over.) How long does it take to run nodetool repair on the node ?

 

With RF 3, if you lose a node you have lost = your redundancy. It's important to have a plan about how to get it back = and how long it may take.   

 

Hope that helps. 

 

-----------------

Aaron Morton

Freelance Cassandra Developer

New Zealand

 

@aaronmorton

 

On 6/12/2012, at 3:40 AM, Alexandru Sicoe = <adsicoe@gmail.com> wrote:



Hi = guys,
Sorry for the late follow-up but I waited to run major compactions on = all 3 nodes at a time before replying with my findings.

Basically we were successful on two of the nodes. They both took ~2 days = and 11 hours to complete and at the end we saw one very large file = ~900GB and the rest much smaller (the overall size decreased). This is = what we expected!

But on the 3rd node, we suspect major compaction didn't actually finish = it's job. First of all nodetool compact returned much earlier than the = rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on = the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not = enabled). Below I pasted some more details about file sizes before and = after compaction on this third node and disk occupancy.

The situation is maybe not so dramatic for us because in less than 2 = weeks we will have a down time till after the new year. During this we = can completely delete all the data in the cluster and start fresh with = TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).

Questions:

1) Do you expect problems with the 3rd node during 2 weeks more of = operations, in the conditions seen below?
[Note: we expect the minor compactions to continue building up files but = never really getting to compacting the large file and thus not needing = much temporarily extra disk space].

2) Should we restart with leveled compaction next year?
[Note: Aaron was right, we have 1 week rows which get deleted after 1 = month which means older rows end up in big files =3D> to free up = space with SizeTiered we will have no choice but run major compactions = which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!]

3) In case we keep SizeTiered:

    - How can we improve the performance of our major = compactions? (we left all config parameters as default). Would = increasing compactions throughput interfere with writes and reads? What = about multi-threaded compactions?

    - Do we still need to run regular repair operations = as well? Do these also do a major compaction or are they completely = separate operations?

[Note: we have 3 nodes with RF=3D2 and inserting at consistency level 1 = and reading at consistency level ALL. We read primarily for exporting = reasons - we export 1 week worth of data at a time].

4) Should we consider increasing the cluster capacity?
[We generate ~5million new rows every week which shouldn't come close to = the hundreds of millions of rows on a node mentioned by Aaron which are = the volumes that would create problems with bloom filters and = indexes].

Cheers,
Alex
------------------

The situation in the data folder

    before calling nodetool comapact:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
376G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
305G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
39G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
78G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
81G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
205M    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
20G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
20G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
20G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
4.9G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
4.9G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
4.9G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
333M    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
92M     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
92M     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
99M     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
2.5G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
1.4T    total

    after nodetool comapact returned:

du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
444G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
910G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
19G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
19G     = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
5.0G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
4.8G    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
338M    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
339M    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
339M    = /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
98M 


Looking at the disk occupancy for the logical partition where the data = folder is in:

df /data_bst
Filesystem           = 1K-blocks      Used Available Use% Mounted = on
= /dev/sdb1           = ; 2927242720 1482502260 1444740460  51% /data_bst


and the situation in the cluster

nodetool -h $HOSTNAME ring (before major compaction)
Address         = DC          = Rack        Status State   = Load            = Effective-Ownership = Token           &nb= sp;            = ;            &= nbsp; 
=             &n= bsp;           &nbs= p;            =             &n= bsp;           &nbs= p;            =             &n= bsp;    = 113427455640312821154458202477256070484    
10.146.44.17    datacenter1 = rack1       Up     = Normal  1.37 TB         = 66.67%           &n= bsp;  = 0            &= nbsp;           &nb= sp;            = ;     
10.146.44.18    datacenter1 = rack1       Up     = Normal  1.04 TB         = 66.67%           &n= bsp;  = 56713727820156410577229101238628035242     
10.146.44.32    datacenter1 = rack1       Up     = Normal  1.14 TB         = 66.67%           &n= bsp;  113427455640312821154458202477256070484

nodetool -h $HOSTNAME ring (after major compaction) (Note we were = inserting data in the meantime)
Address         = DC          = Rack        Status State   = Load            = Effective-Ownership = Token           &nb= sp;            = ;            &= nbsp; 
=             &n= bsp;           &nbs= p;            =             &n= bsp;           &nbs= p;            =             &n= bsp;    = 113427455640312821154458202477256070484    
10.146.44.17    datacenter1 = rack1       Up     = Normal  1.38 TB         = 66.67%           &n= bsp;  = 0            &= nbsp;           &nb= sp;            = ;     
10.146.44.18    datacenter1 = rack1       Up     = Normal  1.08 TB         = 66.67%           &n= bsp;  = 56713727820156410577229101238628035242     
10.146.44.32    datacenter1 = rack1       Up     = Normal  1.19 TB         = 66.67%           &n= bsp;  113427455640312821154458202477256070484

 

On Fri, Nov 23, 2012 at 2:16 AM, aaron = morton <aaron@thelastpickle.com> = wrote:

>  =46rom what I know having too = much data on one node is bad, not really sure why, but  I think = that performance will go down due to the size of indexes and bloom = filters (I may be wrong on the reasons but I'm quite sure you can't = store too much data per node).

If you have many hundreds of millions of = rows on a node the memory needed for bloom filters and index sampling = can be significant. These can both be tuned.

If you have 1.1T per node the time to do a compaction, repair or upgrade = may be very significant. Also the time taken to copy this data should = you need to remove or replace a node may be = prohibitive.


> 2. Switch to Leveled compaction strategy.

I would avoid making a change like that on = an unstable / at risk system.

> - Our usage pattern is write once, read once (export) and delete = once!

 The column TTL may be of use to you, it removes the need to do a = delete.

> - We were thinking of relying on the automatic minor compactions to = free up space for us but as..
There are some usage patterns which make life harder for STS. For = example if you have very long lived rows that are written to and deleted = a lot. Row fragments that have been around for a while will end up in = bigger files, and these files get compacted less often.

In this situation, if you are running low on disk space and you think = there is a lot of deleted data in there, I would run a major compaction. = A word or warning though, if do this you will need to continue to do it = regularly. Major compaction creates a single big file, that will not get compaction often. There are ways to resolve = this, and moving to LDB may help in the future.

If you are stuck and worried about disk space it's what I would do. Once = you are stable again then look at LDB http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com


On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

> Hi Alexandru,
>
> "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 = disk per node for the data dir and separate disk for the commitlog, 12 = cores, 24 GB RAM"
>
> I think you should tune your architecture in a very different way. = =46rom what I know having too much data on one node is bad, not really = sure why, but  I think that performance will go down due to the = size of indexes and bloom filters (I may be wrong on the reasons but I'm quite sure you can't store too much data per node).
>
> Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) = would be better if you have the choice.
>
> "(12GB to Cassandra heap)."
>
> The max heap recommanded is 8GB because if you use more than these = 8GB the Gc jobs will start decreasing your performance.
>
> "We now have 1.1 TB worth of data per node (RF =3D 2)."
>
> You should use RF=3D3 unless one out of consistency or SPOF =  doesn't matter to you.
>
> With RF=3D2 you are obliged to write at CL.one to remove the single = point of failure.
>
> "1. Start issuing regular major compactions (nodetool compact).
>      - This is not recommended:
>             - Stops minor = compactions.
>             - Major performance hit = on node (very bad for us because need to be taking data all the = time)."
>
> Actually, major compaction *does not* stop minor compactions. What = happens is that due to the size of the size of the sstable that remains = after your major compaction, it will never be compacted with the = upcoming new sstables, and because of that, your read performance will go down until you run an other major compaction.
>
> "2. Switch to Leveled compaction strategy.
>       - It is mentioned to help with deletes and = disk space usage. Can someone confirm?"
>
> =46rom what I know, Leveled compaction will not free disk space. It = will allow you to use a greater percentage of your total disk space (50% = max for sized tier compaction vs about 80% for leveled compaction)
>
> "Our usage pattern is write once, read once (export) and delete = once! "
>
> In this case, I think that leveled compaction fits your needs.
>
> "Can anyone suggest which (if any) is better? Are there better = solutions?"
>
> Are your sstable compressed ? You have 2 types of built-in = compression and you may use them depending on the model of each of your = CF.
>
> see: = http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
>
> Alain
>
> 2012/11/22 Alexandru Sicoe <
adsicoe@gmail.com>
> We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 = disk per node for the data dir and separate disk for the commitlog, 12 = cores, 24 GB RAM (12GB to Cassandra heap).
>

 

 



= --Apple-Mail=_22937C26-F466-4AF7-B87F-4BB754025CC9--