Tuesday, June 20, 2017

Storage Spaces Direct (S2D) Storage Jobs Suspended and Degraded Disks

Storage spaces direct is great, but every once and a while a S2D storage job will get a stuck and just sit there in a suspended state. This usually happens after a reboot of one of the nodes in the cluster.

What you don't want to do is take a different node out of the cluster while a storage job is stuck and while there are degraded virtual disks.

You should make a habit out of checking the jobs and the virtual disk status before changing node membership. You can do this easily with the Get-StorageJob and Get-VirtualDisk commandlets. Alternatively, you could use the script I wrote to continually update the status of both the S2D storage jobs and the virtual disk status.

So what does one do if a storage job is stuck? There are two commandlets that I've found will fix this. The first is Optimize-StoragePool. The second is Repair-VirtualDisk. Start with Optimize-StoragePool and if that doesn't work then move on to Repair-VirtualDisk. Here is how you use them:

Get-StoragePool <storage pool friendly name> | Optimize-StoragePool

Example: Get-StoragePool s2d* | Optimize-StoragePool

Get-VirtualDisk <virtual disk friendly name> | Repair-VirtualDisk

Example: Get-VirtualDisk vd01 | Repair-VirtualDisk

Usually optimizing the storage pool takes care of the hung storage job and fixed the degraded virtual disk but if not target the disk directly.

If neither of those work, give Repair-ClusterStorageSpacesDirect / Repair-ClusterS2D a try. I haven't tried this one yet but it looks like it could help.

Update: I tried Repair-ClusterS2D. It does not appear to help with this scenario. There is limited documentation on it but it looks like it's something you use if a virtual disk gets disconnected or something.

Update: Run Get-PhysicalDisk. If any of them say they're in maintenance mode, this could be the cause of your degraded disks and your stuck jobs. This seems to happen when you pause and resume a node to close together. To take the disks our of maintenance mode run the following:

Get-PhysicalDisk | Where-Object { $_.OperationalStatus -eq "In Maintenance Mode" } | Disable-StorageMaintenanceMode

Another Update: If a disk becomes dettached, try this.

10 comments:

  1. I just want to thank you. After installing KB4041688 I had my physical disks in maintenance mode on two of my clusters but had not noticed it. While looking for a reason why my repair jobs where not running I found your post and it pointed me to the correct solution. Thank you

    ReplyDelete
  2. Dito! 2Node S2D Cluster. I drained and paused NODE01 to install KB4041688. Restarted, resumed the node. Did a Get-StorageJob (I always do this now! I've broken our cluster because I didn't realise it hadn't sync'd the disks across the two nodes before). Noticed it was stuck on suspended. I tried the usual trick of Get-VirtualDisk | Repair-VirtualDisk with no joy.

    I googled it and your website blog came up. I've used your post in the past to fix this issue. Was pleased to see that you've updated it.

    I'm beginning to think the suspended jobs are a result of manual intervention? Am I supposed to let Windows handle the draining of roles for updates. I'm just scared to let Windows manage it as it's broken our cluster in the past, probably as a result of hung storage jobs.

    Anyway, thank you for your awesome research on this! Thanks for sharing.



    ReplyDelete
    Replies
    1. I haven't verified it but I think MS changed what they do with physical disks when draining nodes in a clusters w/ S2D. I believe previously when you drained a node they would put the disks in maintenance mode for that node and you would get degraded virtual disks and pending storage jobs. Lately when I've been draining nodes I've noticed that I don't get the storage jobs and the degraded disks until I reboot the node (which obviously makes sense.) Hence I think they might have changed what they do with the physical disks when you pause a node in the cluster... As I said though, I haven't confirmed this or tested the hypothesis. Just something I think I have observed.

      Delete
    2. Hi Scott. Abour this maintenace mode change. Do you have a correct working procedure for taking a node down. I have used the one Microsoft have published (which state that the S2D is in degarded state) but it is not, and I have lost a volume doing a drain role and afterwards a reboot. But have not found a better procedure

      Delete
    3. I always pause the node through cluster administrator prior to rebooting. This moves the VMs off and it will remember where they were. After reboot when you unpause the node, since it remembers what VMs were running on it, it will move the VMs back onto the node. I always always always make sure that the cluster is healthy and all volumes are healthy prior to taking a node out of the cluster. Even if your cluster is designed to withstand a two node failure you ideally want things in a good state before performing maintenance. Hope that helps.

      Delete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. does Repair-ClusterStorageSpacesDirect -Verbose take vdisks offline?

    ReplyDelete
    Replies
    1. Not that I am aware of. You can use it to bring disk out of maitenence mode. So instead of the script I have above you good do the following: Repair-ClusterStorageSpacesDirect -Node [hostname] -DisableStorageMaintenanceMode

      https://bcthomas.com/2017/09/bug-when-applying-kb4038782-september-cu-to-storage-spaces-direct-clusters/

      Delete