incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rustam Aliyev <rustam.li...@code.az>
Subject Re: Problems with shuffle
Date Sat, 13 Apr 2013 23:56:36 GMT
Just a followup on this issue. Due to the cost of shuffle, we decided 
not to do it. Recently, we added new node and ended up in not well 
balanced cluster:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address           Load       Tokens  Owns   Host 
ID                               Rack
UN  10.0.1.8          52.28 GB   260     18.3% 
d28df6a6-c888-4658-9be1-f9e286368dce  rack1
UN  10.0.1.11         55.21 GB 256     9.4%   
7b0cf3c8-0c42-4443-9b0c-68f794299443  rack1
UN  10.0.1.2          49.03 GB   259     17.9% 
2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f  rack1
UN  10.0.1.4          48.51 GB   255     18.4% 
c253dcdf-3e93-495c-baf1-e4d2a033bce3  rack1
UN  10.0.1.1          67.14 GB   253     17.9% 
4f77fd70-b134-486b-9c25-cfea96b6d412  rack1
UN  10.0.1.3          47.65 GB   253     18.0% 
4d03690d-5363-42c1-85c2-5084596e09fc  rack1

It looks like new node took from each other node equal amount of vnodes 
- which is good. However, it's not clear why it decided to have twice 
less than other nodes.

How does Cassandra with vnodes exactly decide how many vnodes to move?

Btw, during JOINING nodetool status command does not show any 
information about joining node. It appears only when join finished (on 
v1.2.3).

-- Rustam


On 08/04/2013 22:33, Rustam Aliyev wrote:
> After 2 days of endless compactions and streaming I had to stop this 
> and cancel shuffle. One of the nodes even complained that there's no 
> free disk space (grew from 30GB to 400GB). After all these problems 
> number of the moved tokens were less than 40 (out of 1280!).
>
> Now, when nodes start they report duplicate ranges. I wonder how bad 
> is that and how do I get rid of that?
>
>  INFO [GossipStage:1] 2013-04-09 02:16:37,920 StorageService.java 
> (line 1386) Nodes /10.0.1.2 and /10.0.1.1 have the same token 
> 99027485685976232531333625990885670910.  Ignoring /10.0.1.2
>  INFO [GossipStage:1] 2013-04-09 02:16:37,921 StorageService.java 
> (line 1386) Nodes /10.0.1.2 and /10.0.1.4 have the same token 
> 43199909863009765869373729459111198718.  Ignoring /10.0.1.2
>
> Overall, I'm not sure how bad it is to leave data unshuffled (I read 
> DataStax blog post, not clear). When adding new node wouldn't it be 
> assigned ranges randomly from all nodes?
>
> Some other notes inline below:
>
> On 08/04/2013 15:00, Eric Evans wrote:
>> [ Rustam Aliyev ]
>>> Hi,
>>>
>>> After upgrading to the vnodes I created and enabled shuffle
>>> operation as suggested. After running for a couple of hours I had to
>>> disable it because nodes were not catching up with compactions. I
>>> repeated this process 3 times (enable/disable).
>>>
>>> I have 5 nodes and each of them had ~35GB. After shuffle operations
>>> described above some nodes are now reaching ~170GB. In the log files
>>> I can see same files transferred 2-4 times to the same host within
>>> the same shuffle session. Worst of all, after all of these I had
>>> only 20 vnodes transferred out of 1280. So if it will continue at
>>> the same speed it will take about a month or two to complete
>>> shuffle.
>> As Edward says, you'll need to issue a cleanup post-shuffle if you 
>> expect
>> to see disk usage match your expectations.
>>
>>> I had few question to better understand shuffle:
>>>
>>> 1. Does disabling and re-enabling shuffle starts shuffle process from
>>>     scratch or it resumes from the last point?
>> It resumes.
>>
>>> 2. Will vnode reallocations speedup as shuffle proceeds or it will
>>>     remain the same?
>> The shuffle proceeds synchronously, 1 range at a time; It's not going to
>> speed up as it progresses.
>>
>>> 3. Why I see multiple transfers of the same file to the same host? 
>>> e.g.:
>>>
>>>     INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038
>>>     StreamReplyVerbHandler.java (line 44) Successfully sent
>>> /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
>>>     to /10.0.1.8
>>>     INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427
>>>     StreamReplyVerbHandler.java (line 44) Successfully sent
>>> /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
>>>     to /10.0.1.8
>> I'm not sure, but perhaps that file contained data for two different
>> ranges?
> Does it mean that if I have huge file (e.g. 20GB) which contain a lot 
> of ranges (let's say 100) it will be transferred each time (20GB*100)?
>>
>>> 4. When I enable/disable shuffle I receive warning message such as
>>>     below. Do I need to worry about it?
>>>
>>>     cassandra-shuffle -h localhost disable
>>>     Failed to enable shuffling on 10.0.1.1!
>>>     Failed to enable shuffling on 10.0.1.3!
>> Is that the verbatim output?  Did it report failing to enable when you
>> tried to disable?
> Yes, this is verbatim output. It reports failure for enable as well as 
> disable. Nodes .1.1 and .1.3 were not RELOCATING unless I ran 
> cassandra-shuffle enable command on them locally.
>>
>> As a rule of thumb though, you don't want an disable/enable to result in
>> only a subset of nodes shuffling.  Are there no other errors? What do
>> the logs say?
> No errors in logs. Only INFO about streams and WARN about relocation.
>>
>>> I couldn't find many docs on shuffle, only read through JIRA and
>>> original proposal by Eric.
>


Mime
View raw message