cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anishek Agarwal <anis...@gmail.com>
Subject Re: Cassandra nodes reduce disks per node
Date Thu, 25 Feb 2016 11:28:54 GMT
Nice thanks !

On Thu, Feb 25, 2016 at 1:51 PM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

> For what it is worth, I finally wrote a blog post about this -->
> http://thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html
>
> If you are not done yet, every step is detailed in there.
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-02-19 10:04 GMT+01:00 Alain RODRIGUEZ <arodrime@gmail.com>:
>
>> Alain, thanks for sharing!  I'm confused why you do so many repetitive
>>> rsyncs.  Just being cautious or is there another reason?  Also, why do you
>>> have --delete-before when you're copying data to a temp (assumed empty)
>>> directory?
>>
>>
>>  Since they are immutable I do a first sync while everything is up and
>>> running to the new location which runs really long. Meanwhile new ones are
>>> created and I sync them again online, much less files to copy now. After
>>> that I shutdown the node and my last rsync now has to copy only a few files
>>> which is quite fast and so the downtime for that node is within minutes.
>>
>>
>> Jan guess is right. Except for the "immutable" thing. Compaction can
>> make big files go away, replaced by bigger ones you'll have to stream again.
>>
>> Here is a detailed explanation about what I did it this way.
>>
>> More precisely, let's say we have 10 files of 100 GB on the disk to
>> remove (let's say 'old-dir')
>>
>> I run a first rsync to an empty folder indeed (let's call this
>> 'tmp-dir'), in the disk that will remain after the operation. Let's say
>> this takes about 10 hours. This can be run in parallel though.
>>
>> So I now have 10 files of 10GB on the tmp-dir. But meanwhile one
>> compaction triggered and I now have 6 files of 100 GB and 1 of 350 GB.
>>
>> At this point I disable compaction, stop running ones.
>>
>> My second rsync has to remove the 4 files that were compacted from
>> tmp-dir, so that's why I use the '--delete-before'. As this tmp-dir
>> needs to be mirroring old-dir, this is fine. This new operation takes 3.5
>> hours, also runnable in parallel (Keep in mind C* won't compact anything
>> for 3.5 hours, that's why I did not stopped compaction before the first
>> rsync, in my case dataset was 2 TB big)
>>
>> At this point I have 950 GB in tmp-dir, but meanwhile clients continued
>> to write on the disk. let's say 50 GB more.
>>
>> 3rd rsync will take 0.5 hour, no compaction ran, so I just have to add
>> the diff to tmp-dir. Still runnable in parallel.
>>
>> Then the script stop the node, so should be run sequentially, and perform
>> 2 more rsync, the first one to take the diff between end of 3rd rsync and
>> the moment you stop the node, should be a few seconds, minutes maybe,
>> depending how fast you ran the script after 3rd rsync ended. The second
>> rsync in the script is a 'useless' one. I just like to control things. I
>> run it, expect to see it to say that there is no diff. It is just a way to
>> stop the script if for some reason data is still being appended to old-dir.
>>
>> Then I just move all the files from tmp-dir to new-dir (the proper data
>> dir remaining after the operation). This is an instant op a files are not
>> really moved as they already are on disk. That's due to system files
>> property.
>>
>> I finally unmount and rm -rf old-dir.
>>
>> So the full op takes 10h + 3.5 h + 0.5h + (number of noodes * 0.1 h) and
>> nodes are down for about 5-10 min.
>>
>> VS
>>
>> Straight forward way (stop node, move, start node) : 10 h * number of
>> node as this needs to be sequential. Plus each node is down for 10 hours,
>> you have to repair them as it is higher than hinted handoff limit...
>>
>> Branton, I did not went through your process, but I guess you will be
>> able to review it by yourself after reading the above (typically, repair is
>> not needed if you use the strategy I describe above, as node is down for
>> 5-10 minutes). Also, not sure how "rsync -azvuiP
>> /var/data/cassandra/data2/ /var/data/cassandra/data/" will behave, my guess
>> i this is going to do a copy, so this might be very long. My script perform
>> an instant move and as the next command is 'rm -Rf
>> /var/data/cassandra/data2' I see no reason copying rather than moving files.
>>
>> Your solution would probably work, but with big constraints on
>> operational point of view (very long operation + repair needed)
>>
>> Hope this long email will be useful, maybe should I blog about this. Let
>> me know if the process above makes sense or if some things might be
>> improved.
>>
>> C*heers,
>> -----------------
>> Alain Rodriguez
>> France
>>
>> The Last Pickle
>> http://www.thelastpickle.com
>>
>> 2016-02-19 7:19 GMT+01:00 Branton Davis <branton.davis@spanning.com>:
>>
>>> Jan, thanks!  That makes perfect sense to run a second time before
>>> stopping cassandra.  I'll add that in when I do the production cluster.
>>>
>>> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten <j.kesten@enercast.de>
>>> wrote:
>>>
>>>> Hi Branton,
>>>>
>>>> two cents from me - I didnt look through the script, but for the rsyncs
>>>> I do pretty much the same when moving them. Since they are immutable I do
a
>>>> first sync while everything is up and running to the new location which
>>>> runs really long. Meanwhile new ones are created and I sync them again
>>>> online, much less files to copy now. After that I shutdown the node and my
>>>> last rsync now has to copy only a few files which is quite fast and so the
>>>> downtime for that node is within minutes.
>>>>
>>>> Jan
>>>>
>>>>
>>>>
>>>> Von meinem iPhone gesendet
>>>>
>>>> Am 18.02.2016 um 22:12 schrieb Branton Davis <
>>>> branton.davis@spanning.com>:
>>>>
>>>> Alain, thanks for sharing!  I'm confused why you do so many repetitive
>>>> rsyncs.  Just being cautious or is there another reason?  Also, why do you
>>>> have --delete-before when you're copying data to a temp (assumed empty)
>>>> directory?
>>>>
>>>> On Thu, Feb 18, 2016 at 4:12 AM, Alain RODRIGUEZ <arodrime@gmail.com>
>>>> wrote:
>>>>
>>>>> I did the process a few weeks ago and ended up writing a runbook and
a
>>>>> script. I have anonymised and share it fwiw.
>>>>>
>>>>> https://github.com/arodrime/cassandra-tools/tree/master/remove_disk
>>>>>
>>>>> It is basic bash. I tried to have the shortest down time possible,
>>>>> making this a bit more complex, but it allows you to do a lot in parallel
>>>>> and just do a fast operation sequentially, reducing overall operation
time.
>>>>>
>>>>> This worked fine for me, yet I might have make some errors while
>>>>> making it configurable though variables. Be sure to be around if you
decide
>>>>> to run this. Also I automated this more by using knife (Chef), I hate
to
>>>>> repeat ops, this is something you might want to consider.
>>>>>
>>>>> Hope this is useful,
>>>>>
>>>>> C*heers,
>>>>> -----------------
>>>>> Alain Rodriguez
>>>>> France
>>>>>
>>>>> The Last Pickle
>>>>> http://www.thelastpickle.com
>>>>>
>>>>> 2016-02-18 8:28 GMT+01:00 Anishek Agarwal <anishek@gmail.com>:
>>>>>
>>>>>> Hey Branton,
>>>>>>
>>>>>> Please do let us know if you face any problems  doing this.
>>>>>>
>>>>>> Thanks
>>>>>> anishek
>>>>>>
>>>>>> On Thu, Feb 18, 2016 at 3:33 AM, Branton Davis <
>>>>>> branton.davis@spanning.com> wrote:
>>>>>>
>>>>>>> We're about to do the same thing.  It shouldn't be necessary
to shut
>>>>>>> down the entire cluster, right?
>>>>>>>
>>>>>>> On Wed, Feb 17, 2016 at 12:45 PM, Robert Coli <rcoli@eventbrite.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 16, 2016 at 11:29 PM, Anishek Agarwal <
>>>>>>>> anishek@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> To accomplish this can I just copy the data from disk1
to disk2
>>>>>>>>> with in the relevant cassandra home location folders,
change the
>>>>>>>>> cassanda.yaml configuration and restart the node. before
starting i will
>>>>>>>>> shutdown the cluster.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>> =Rob
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message