Thursday, April 19, 2018

S2D Recovering a Detached Virtual Disk with No Redundancy

So you've got your S2D cluster and you lost redundancy. Now what? How can this happen? Well if you restart too many nodes this can happen. Lets walk through a sample scenario and see what happens.


Jump to the bottom of the article if you want to skip all the fluff and get right to fix

First the setup:
For this test I setup a 2 node cluster and I created 2 mirrored virtual disks; vd01 and vd02. I setup volumes on them and imported a couple VMs on to each volume. I setup up some VMs so that  the system would be generating IO. Check the Operational Status and the Health Status of the virtual disks by running Get-VirtualDisk. They should all be healthy. I use this script to constantly refresh the virtual disk health and the storage jobs.

Next the failure:
Restart one of the nodes. The virtual disks should have a warning and their operational status should change to a degraded or the degraded,incomplete state. You should also see some repair jobs in the suspended state.

Name   IsBackgroundTask ElapsedTime JobState  PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- --------  --------------- -------------- ----------
Repair True             00:00:12    Suspended 0               0              29527900160
Repair True             00:04:06    Suspended 0               0              36775657472



FriendlyName ResiliencySettingName OperationalStatus      HealthStatus IsManualAttach Size
------------ --------------------- -----------------      ------------ -------------- ----
vd02                               {Degraded, Incomplete} Warning      True           1 TB
vd01                               {Degraded, Incomplete} Warning      True           1 TB

Once the node comes back online the storage jobs should start running and the disks will be in service. You may see a state of degraded too while storage jobs run. The data that has changed while the node was down is being rebuilt.

Name   IsBackgroundTask ElapsedTime JobState  PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- --------  --------------- -------------- ----------
Repair True             00:00:04    Suspended 0               0              19058917376
Repair True             00:03:03    Running   16              6276775936     37580963840



FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02                               InService         Warning      True           1 TB

vd01                               InService         Warning      True           1 TB

Now before the storage jobs finish rebuilding the redundancy, restart the other node. The disks will likely go into a detached operational status.

Name   IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- -------- --------------- -------------- ----------
Repair False            00:00:00    Killed   0



FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02                               Detached          Unknown      True           1 TB
vd01                               Detached          Unknown      True           1 TB

You may see a status that says No Redundancy.

Name   IsBackgroundTask ElapsedTime JobState  PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- --------  --------------- -------------- ----------
Repair True             00:03:12    Suspended 0               0              5368709120



FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02                               No Redundancy     Unhealthy    True           1 TB
vd01                               Detached          Unknown      True           1 TB

In this case vd02 is still online but not accessible from C:\ClusterStorage\. If you try to run a repair on the virtual disk with "No Redundancy" you get the following:

PS C:\Users\administrator.SHORELAND_NT> Get-VirtualDisk vd02 | Repair-VirtualDisk
Repair-VirtualDisk : There is not enough redundancy remaining to repair the virtual disk.
Activity ID: {64afdbc9-9ce4-4108-9aac-f4da6d277585}
At line:1 char:24
+ Get-VirtualDisk vd02 | Repair-VirtualDisk
+                        ~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (StorageWMI:ROOT/Microsoft/...SFT_VirtualDisk) [Repair-VirtualDisk], CimEx
   ception
    + FullyQualifiedErrorId : StorageWMI 50001,Repair-VirtualDisk

This test system is a clean build with terabytes of storage so it's not a space issue. If your virtual disks only says "No Redundancy" you may want to wait a bit and/or try to offline and then online the disk. This has fixed it before for me.

The same process for recreating the failure will apply for a 3 node or greater setup. Depending on your redundancy though you may have to fail multiple nodes at a time.

For the detached disks, in cluster administrator you will see that the virtual disks are in a failed state. If you try to bring a the virtual disk online through the FCM you'll get an error that says "The system cannot find the drive specified." Error Code 0x8007000F

If you try to connect the virtual disk through powershell you'll get:

Get-VirtualDisk | Where-Object -Filter { $_.OperationalStatus -eq "Detached" } | Connect-VirtualDisk

Connect-VirtualDisk : Access denied

Extended information:
Access is denied.

Recommended Actions:
- Check if you have the necessary privileges to perform the operation.
- Perform the operation from Failover Cluster Manager if the resource is clustered.

Activity ID: {583d3820-dacb-4246-93cf-b52d05d17911}
At line:1 char:82
+ ... -Filter { $_.OperationalStatus -eq "Detached" } | Connect-VirtualDisk
+                                                       ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : PermissionDenied: (StorageWMI:ROOT/Microsoft/...SFT_VirtualDisk) [Connect-VirtualDisk],
   CimException

    + FullyQualifiedErrorId : StorageWMI 40001,Connect-VirtualDisk

Finally the solution:
Here is what you need to do.

1. Remove all the disks and the pool from the cluster
  • In the failover cluster manager select "Pools" on the left under storage
  • Select your storage pool in the top pain and then the virtual disks tab on the bottom pain
  • Right click each virtual disk and "Remove from Cluster Shared Volumes"
  • Right click each virtual disk and "Remove"
  • Right click the storage pool in the top pain and "Remove"
2. Go to the Server Manager, File and Storage Services, locate the storage pool, do right click and choose the option “Set Read-Write Access”. Choose one of the nodes. I would choose the server that you have the Server Manager pulled up on. This should allow the single node you selected to have control over the storage pool. This is the key.

3. Select the failed virtual disks, right click and try to attach. It will likely fail but it's going to start the repair process automatically. You can watch the status of the repair with Get-VirtualDisk and Get-StorageJob. Or again you can use the script I created.

Name   IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- -------- --------------- -------------- ----------
Repair True             00:00:09    Running  0               0              20937965568
Repair True             00:01:46    Running  16              8020033536     47513075712



FriendlyName ResiliencySettingName OperationalStatus          HealthStatus IsManualAttach Size
------------ --------------------- -----------------          ------------ -------------- ----
vd02                               {No Redundancy, InService} Unhealthy    True           1 TB
vd01                               {No Redundancy, InService} Unhealthy    True           1 TB

The Server Manager should show a warning icon next to the virtual disk while it repairs. You may have to refresh.

4. Once the repair is done the virtual disk's operation status should go to Ok and Health Status to Healthy. The jobs should complete and there should be no more running jobs. The failed and warning icons should go away in the server manager. You may have to refresh. You will be able to attach the virtual disk now. 

5. Re-add the pool and the virtual disks to the cluster again.
  • In the failover cluster manager select "Pools" on the left under storage
  • Right click "Pools" and select "Add Storage Pool". Select your pool and hit Ok.
  • Select your storage pool in the top pain, right click "Add Virtual Disk". Select all of your virtual disk and hit Ok.
  • Right click each virtual disk and "Add to Cluster Shared Volumes"
6. Start your VMs back up. You may bave to online the VM's resources if they're not online.

The downside to this process is if you only have one virtual disk that is in a failed/detached/no redundancy state and all the others are fine. You have to take all the virtual disks and the pool out of the cluster to peform the recovery. You may have valid/healthy virtual disks that need to be down (not in the cluster and not being exposed as a CSV) thus you may have VMs that have to be down when their underlying storage is healthy. You could move these to a different location prior to performing the above procedure. Just something to be aware of.

4.20.2018 Updated and Better Solution:
After I wrote this post, I found the following blogpost and gave it a shot. This solution seems to do the trick and it doesn't require taking the storag pool offline or any of your virtual disks that are in perfectly working shape! The gist of it is this:

Remove the disk from CSV. Set diskrunchkdsk and diskrecoveryaction cluster parameters on the disk. Start the disk and start the recovery. Let the recovery job finish. Then stop the disk, revert the cluster paramater settings, add it back to CSV and start the disk.

Remove-Clustersharedvolume -name "Cluster Disk 1"

Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrunchkdsk -Value 7
Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrecoveryaction -Value 1
Start-clusterresource -Name "Cluster Disk 1"

Get-ScheduledTask -TaskName "Data Integrity Scan for Crash Recovery" | Start-ScheduledTask

Storage jobs will start, once they finish run:

Stop-clusterresource -Name "Cluster Disk 1"

Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrecoveryaction -Value 0
Get-Clusterresource -Name "Cluster Disk 1" | set-clusterparameter -name diskrunchkdsk -value 0

Add-clustersharedvolume -Name "Cluster Disk 1"
Start-clusterresource -Name "Cluster Disk 1"



Removing the CSV, adding the CSV and starting/stopping the resource can be done via the FCM

5.11.2018 Update
If you have a 2-node cluster or if you're using 2-way mirroring, read this. You might have a disk that is failing and you might not be able to recover the virtual disk/volume. Check for unrecoverable read errors.


5 comments:

  1. thank you VERY MUCH! You save us!

    ReplyDelete
  2. Thanks a lot as well. You just saved us 3 days of work. The method described on 4.20.2018 worked like a charm

    ReplyDelete
  3. Thank you for sharing...I don't know how anyone finds out about these attributes "diskrunchkdsk" or "diskrecoveryaction" and what values are acceptable...but this is amazing stuff.

    This saved us a lot of additional downtime, and very possibly some data loss...we're not overly thrilled with S2D and Hyper-V right now, but I imagine the alternative isn't all sunshine all the time either.

    ReplyDelete
  4. Excellent work. I hope to blog about my adventures in using S2D for FCI in the near future.

    ReplyDelete