cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Reddy <mark.re...@boxever.com>
Subject Re: Cassandra disk usage
Date Sun, 13 Apr 2014 18:17:09 GMT
>
> i I will change the data i am storing to decrease the usage , in value i
> will find some small value to store.Previously i used same value since this
> table is index only for search purposed and does not really has value.


If you don't need a value, you don't have to store anything. You can store
the column name and leave the value empty, this is a common practice.

1) What should be recommended read and write consistency and replication
> factor for 3 nodes with option of future increase server numbers?


Both consistency level and replication factor are tuneable depending on
your application constraints. I'd say a CL or quorum and RF of 3 is the
general practice.

Still it has 1.5X of overall data how can this be resolved and what is
> reason for that?


As Michał pointed out there is a 15 byte column overhead to consider here,
where:

total_column_size = column_name_size + column_value_size + 15


This link might shed some light on this:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html

Also i see that data is in different size on all nodes , does that means
> that servers are out of sync


How much is it out by? Data size may differ due to deletes, as you
mentioned you do deletes. What is the output of 'nodetool ring'?


On Sun, Apr 13, 2014 at 6:42 PM, Michal Michalski <
michal.michalski@boxever.com> wrote:

> > Each columns have name of 15 chars ( digits ) and same 15 chars in
> value ( also digits ).
> > Each column should have 30 bytes.
>
> Remember about the standard Cassandra's column overhead which is, as far
> as I remember, 15 bytes, so it's 45 bytes in total - 50% more than you
> estimated, which kind of matches your 3 GB vs 4.5 GB case.
>
> There's also a per-row overhead, but I'm not sure about its size in
> current C* versions - I remember it was about 25 bytes or so some time ago,
> but it's not important in your case.
>
> Kind regards,
> Michał Michalski,
> michal.michalski@boxever.com
>
>
> On 13 April 2014 17:48, Yulian Oifa <oifa.yulian@gmail.com> wrote:
>
>> Hello Mark and thanks for you reply.
>> 1) i store is as UTF8String.All digits are from 0x30 to 0x39 and should
>> take 1 byte each digit. Since all characters are digits it should have 15
>> bytes.
>> 2) I will change the data i am storing to decrease the usage , in value i
>> will find some small value to store.Previously i used same value since this
>> table is index only for search purposed and does not really has value.
>> 3) You are right i read and write in quorum and it was my mistake ( i
>> though that if i write in quorum then data will be written to 2 nodes only).
>> If i check the keyspace
>> create keyspace USER_DATA
>>   with placement_strategy = 'NetworkTopologyStrategy'
>>   and strategy_options = [{19 : 3}]
>>   and durable_writes = true;
>>
>> it has replication factor of 3.
>> Therefore i have several questions
>> 1) What should be recommended read and write consistency and replication
>> factor for 3 nodes with option of future increase server numbers?
>> 2) Still it has 1.5X of overall data how can this be resolved and what is
>> reason for that?
>> 3) Also i see that data is in different size on all nodes , does that
>> means that servers are out of sync???
>>
>> Thanks and best regards
>> Yulian Oifa
>>
>>
>> On Sun, Apr 13, 2014 at 7:03 PM, Mark Reddy <mark.reddy@boxever.com>wrote:
>>
>>> What are you storing these 15 chars as; string, int, double, etc.? 15
>>> chars does not translate to 15 bytes.
>>>
>>> You may be mixing up replication factor and quorum when you say "Cassandra
>>> cluster has 3 servers, and data is stored in quorum ( 2 servers )." You
>>> read and write at quorum (N/2)+1 where N=total_number_of_nodes and your
>>> data is replicated to the number of nodes you specify in your replication
>>> factor. Could you clarify?
>>>
>>> Also if you are concerned about disk usage, why are you storing the same
>>> 15 char value in both the column name and value? You could just store it as
>>> the name and half your data usage :)
>>>
>>>
>>>
>>>
>>> On Sun, Apr 13, 2014 at 4:26 PM, Yulian Oifa <oifa.yulian@gmail.com>wrote:
>>>
>>>> I have column family with 2 raws.
>>>> 2 raws have overall 100 million columns.
>>>> Each columns have name of 15 chars ( digits ) and same 15 chars in
>>>> value ( also digits ).
>>>> Each column should have 30 bytes.
>>>> Therefore all data should contain approximately 3GB.
>>>> Cassandra cluster has 3 servers , and data is stored in quorum ( 2
>>>> servers ).
>>>> Therefore each server should have 3GB*2/3=2GB of data for this column
>>>> family.
>>>> Table is almost never changed , data is only removed from this table ,
>>>> which possibly created tombstones , but it should not increase the usage.
>>>> However when i check the data i see that each server has more then 4GB
>>>> of data ( more then twice of what should be).
>>>>
>>>> server 1:
>>>> -rw-r--r-- 1 root root 3506446057 Dec 26 12:02 freeNumbers-g-264-Data.db
>>>> -rw-r--r-- 1 root root  814699666 Dec 26 12:24 freeNumbers-g-281-Data.db
>>>> -rw-r--r-- 1 root root  198432466 Dec 26 12:27 freeNumbers-g-284-Data.db
>>>> -rw-r--r-- 1 root root   35883918 Apr 12 20:07 freeNumbers-g-336-Data.db
>>>>
>>>> server 2:
>>>> -rw-r--r-- 1 root root 3448432307 Dec 26 11:57 freeNumbers-g-285-Data.db
>>>> -rw-r--r-- 1 root root  762399716 Dec 26 12:22 freeNumbers-g-301-Data.db
>>>> -rw-r--r-- 1 root root  220887062 Dec 26 12:23 freeNumbers-g-304-Data.db
>>>> -rw-r--r-- 1 root root   54914466 Dec 26 12:26 freeNumbers-g-306-Data.db
>>>> -rw-r--r-- 1 root root   53639516 Dec 26 12:26 freeNumbers-g-305-Data.db
>>>> -rw-r--r-- 1 root root   53007967 Jan  8 15:45 freeNumbers-g-314-Data.db
>>>> -rw-r--r-- 1 root root     413717 Apr 12 18:33 freeNumbers-g-359-Data.db
>>>>
>>>>
>>>> server 3:
>>>> -rw-r--r-- 1 root root 4490657264 Apr 11 18:20 freeNumbers-g-358-Data.db
>>>> -rw-r--r-- 1 root root     389171 Apr 12 20:58 freeNumbers-g-360-Data.db
>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:20
>>>> freeNumbers-g-358-Statistics.db
>>>> -rw-r--r-- 1 root root       4276 Apr 11 18:24
>>>> freeNumbers-g-359-Statistics.db
>>>> -rw-r--r-- 1 root root       4276 Apr 12 20:58
>>>> freeNumbers-g-360-Statistics.db
>>>> -rw-r--r-- 1 root root        976 Apr 11 18:20
>>>> freeNumbers-g-358-Filter.db
>>>> -rw-r--r-- 1 root root        208 Apr 11 18:24 freeNumbers-g-359-Data.db
>>>> -rw-r--r-- 1 root root         78 Apr 11 18:20
>>>> freeNumbers-g-358-Index.db
>>>> -rw-r--r-- 1 root root         52 Apr 11 18:24
>>>> freeNumbers-g-359-Index.db
>>>> -rw-r--r-- 1 root root         52 Apr 12 20:58
>>>> freeNumbers-g-360-Index.db
>>>> -rw-r--r-- 1 root root         16 Apr 11 18:24
>>>> freeNumbers-g-359-Filter.db
>>>> -rw-r--r-- 1 root root         16 Apr 12 20:58
>>>> freeNumbers-g-360-Filter.db
>>>>
>>>> When i try to compact i get the following notification from first
>>>> server :
>>>> INFO [CompactionExecutor:1604] 2014-04-13 18:23:07,260
>>>> CompactionController.java (line 146) Compacting large row
>>>> USER_DATA/freeNumbers:8bdf9678-6d70-11e3-85ab-80e385abf85d (4555076689
>>>> bytes) incrementally
>>>>
>>>> Which confirms that there is around 4.5GB of data on that server only.
>>>> Why does cassandra takes so much data???
>>>>
>>>> Best regards
>>>> Yulian Oifa
>>>>
>>>>
>>>
>>
>

Mime
View raw message