Tuesday, July 10, 2018

The Proper Way to Take a Storage Spaces Direct Server Offline for Maintenance?

Way back in September 2017 a Microsoft update changed the behavior of what happened when you suspended(paused)/resumed a node from a S2D cluster. I don't think the article "Taking a Storage Spaces Direct Server offline for maintenance" has been updated to reflect this change?

Previous to the September 2017 update, when you suspended a node, either view powershell Suspend-ClusterNode or via the Failover Cluster Manager GUI Pause option, the operation would put all the disks on that node in maintenance mode. Then when you resumed the node, the resume operation would take the disks out of maintenance mode.

The current suspend/resume logic does nothing w/ the disks. If you suspend a node it's disks don't go into maintenance mode and if you resume the node nothing is done to the disks.

I postulate what you need to do after you suspend/drain the node and prior to shutting it down or restarting it is put the disks for that node into maintenance mode. This can be done with the following powershell command:

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "node1"} | Enable-StorageMaintenanceMode

Be sure to change "node1" to the name of the node you've suspended in the above powershell snippet.

When the node is powered on/rebooted, prior to resuming the node, you need to take the disks for that node out of maintenance mode. Which can be done with the following command:

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "node1"} | Disable-StorageMaintenanceMode

The reason I'm thinking this should be done is that if you just reboot the node without putting the disk in maintenance mode then the cluster will behave as if it lost that node. Things will recover eventually but timeouts may occur and if you're system is extremely busy with IO bad things could happen (VMs rebooting, CSVs moving, etc.) I'm thinking it's better to put the disks for the node in maintenance mode so all the other nodes know what's going on and the recovery logic doesn't need to kick in. Think of it this way, it's better to tell all the nodes what's going on then to make them have to figure out what's going on... I need to test this theory some more...

Update: It looks like the May 8th 2018 update (KB103723) "introduced SMB Resilient Handles for the S2D intra-cluster network to improve resiliency to transient network failures. This had some side effects in increased timeouts when a node is rebooted, which can effect a system under load. Symptoms include event ID 5120’s with a status code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a node is rebooted." The above procedure is the work around until a fix is available.

Another symptom, at least for me, was connections to a guest SQL cluster timing out. When a node was rebooted prior to the May update everything was fine. After applying the May update and rebooting a node, SQL time outs would occur.

Update: MS released an article and for the time being you should put the disks in maintenance mode prior to rebooting it would seem. Also you might want to disable live dumps it would appear.

Update: Use $Env:ComputerName and run on the computer you want to perform maintenance on instead of specifying the node name:

Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "$($Env:ComputerName)"}
| Enable-StorageMaintenanceMode
| Disable-StorageMaintenanceMode


2 comments:

  1. I'm curious if this would have prevented me from being impacted recently. When I added a couple new disks to each host a rebalance job was kicked off as expected, and I paused and rebooted a host following MS doc you referenced. Then after the repair jobs were finished from that host coming back up, I paused rebooted another (all virtualdisks were healthy) and when it came back up the cluster freaked out. I'm assuming it was due to the rebalance not finishing, but it didn't have any issues with rebooting the first host. If the disks were in maintenance mode as you suggest I wonder if the rebalance job could have waited and recovered without causing an outage.

    ReplyDelete
  2. I had a problem putting disks into maintenance mode because it would timeout. After a February 2019 update to Server 2016 (KB4487006) it now works properly. It doesn't seem to cause a problem if I do or do not put the disks into maintenance mode before rebooting a host, but I typically do based on my logic matching yours.

    ReplyDelete