kreel bits: S2D Recovering a Detached Virtual Disk with No Redundancy

So you've got your S2D cluster and you lost redundancy. Now what? How can this happen? Well if you restart too many nodes this can happen. Lets walk through a sample scenario and see what happens.

Jump to the bottom of the article if you want to skip all the fluff and get right to fix

First the setup:
For this test I setup a 2 node cluster and I created 2 mirrored virtual disks; vd01 and vd02. I setup volumes on them and imported a couple VMs on to each volume. I setup up some VMs so that the system would be generating IO. Check the Operational Status and the Health Status of the virtual disks by running Get-VirtualDisk. They should all be healthy. I use this script to constantly refresh the virtual disk health and the storage jobs.

Next the failure:
Restart one of the nodes. The virtual disks should have a warning and their operational status should change to a degraded or the degraded,incomplete state. You should also see some repair jobs in the suspended state.

Name IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
---- ---------------- ----------- -------- --------------- -------------- ----------
Repair True 00:00:12 Suspended 0 0 29527900160
Repair True 00:04:06 Suspended 0 0 36775657472

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02 {Degraded, Incomplete} Warning True 1 TB
vd01 {Degraded, Incomplete} Warning True 1 TB

Once the node comes back online the storage jobs should start running and the disks will be in service. You may see a state of degraded too while storage jobs run. The data that has changed while the node was down is being rebuilt.

Name IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
---- ---------------- ----------- -------- --------------- -------------- ----------
Repair True 00:00:04 Suspended 0 0 19058917376
Repair True 00:03:03 Running 16 6276775936 37580963840

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02 InService Warning True 1 TB

vd01 InService Warning True 1 TB

Now before the storage jobs finish rebuilding the redundancy, restart the other node. The disks will likely go into a detached operational status.

Name IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
---- ---------------- ----------- -------- --------------- -------------- ----------
Repair False 00:00:00 Killed 0

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02 Detached Unknown True 1 TB
vd01 Detached Unknown True 1 TB

You may see a status that says No Redundancy.

Name IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
---- ---------------- ----------- -------- --------------- -------------- ----------
Repair True 00:03:12 Suspended 0 0 5368709120

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02 No Redundancy Unhealthy True 1 TB
vd01 Detached Unknown True 1 TB

In this case vd02 is still online but not accessible from C:\ClusterStorage\. If you try to run a repair on the virtual disk with "No Redundancy" you get the following:

PS C:\Users\administrator.SHORELAND_NT> Get-VirtualDisk vd02 | Repair-VirtualDisk

Repair-VirtualDisk : There is not enough redundancy remaining to repair the virtual disk.

Activity ID: {64afdbc9-9ce4-4108-9aac-f4da6d277585}

At line:1 char:24

+ Get-VirtualDisk vd02 | Repair-VirtualDisk

+ ~~~~~~~~~~~~~~~~~~

+ CategoryInfo : NotSpecified: (StorageWMI:ROOT/Microsoft/...SFT_VirtualDisk) [Repair-VirtualDisk], CimEx

ception

+ FullyQualifiedErrorId : StorageWMI 50001,Repair-VirtualDisk

This test system is a clean build with terabytes of storage so it's not a space issue. If your virtual disks only says "No Redundancy" you may want to wait a bit and/or try to offline and then online the disk. This has fixed it before for me.

The same process for recreating the failure will apply for a 3 node or greater setup. Depending on your redundancy though you may have to fail multiple nodes at a time.

For the detached disks, in cluster administrator you will see that the virtual disks are in a failed state. If you try to bring a the virtual disk online through the FCM you'll get an error that says "The system cannot find the drive specified." Error Code 0x8007000F

If you try to connect the virtual disk through powershell you'll get:

Get-VirtualDisk | Where-Object -Filter { $_.OperationalStatus -eq "Detached" } | Connect-VirtualDisk

Connect-VirtualDisk : Access denied

Extended information:
Access is denied.

Recommended Actions:
- Check if you have the necessary privileges to perform the operation.
- Perform the operation from Failover Cluster Manager if the resource is clustered.

Activity ID: {583d3820-dacb-4246-93cf-b52d05d17911}
At line:1 char:82
+ ... -Filter { $_.OperationalStatus -eq "Detached" } | Connect-VirtualDisk
+ ~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : PermissionDenied: (StorageWMI:ROOT/Microsoft/...SFT_VirtualDisk) [Connect-VirtualDisk],
CimException

+ FullyQualifiedErrorId : StorageWMI 40001,Connect-VirtualDisk

Finally the solution:

Here is what you need to do.

1. Remove all the disks and the pool from the cluster

In the failover cluster manager select "Pools" on the left under storage
Select your storage pool in the top pain and then the virtual disks tab on the bottom pain
Right click each virtual disk and "Remove from Cluster Shared Volumes"
Right click each virtual disk and "Remove"
Right click the storage pool in the top pain and "Remove"

2. Go to the Server Manager, File and Storage Services, locate the storage pool, do right click and choose the option “Set Read-Write Access”. Choose one of the nodes. I would choose the server that you have the Server Manager pulled up on. This should allow the single node you selected to have control over the storage pool. This is the key.

3. Select the failed virtual disks, right click and try to attach. It will likely fail but it's going to start the repair process automatically. You can watch the status of the repair with Get-VirtualDisk and Get-StorageJob. Or again you can use the script I created.

Name IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal

---- ---------------- ----------- -------- --------------- -------------- ----------

Repair True 00:00:09 Running 0 0 20937965568

Repair True 00:01:46 Running 16 8020033536 47513075712

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size

------------ --------------------- ----------------- ------------ -------------- ----

vd02 {No Redundancy, InService} Unhealthy True 1 TB

vd01 {No Redundancy, InService} Unhealthy True 1 TB

The Server Manager should show a warning icon next to the virtual disk while it repairs. You may have to refresh.

4. Once the repair is done the virtual disk's operation status should go to Ok and Health Status to Healthy. The jobs should complete and there should be no more running jobs. The failed and warning icons should go away in the server manager. You may have to refresh. You will be able to attach the virtual disk now.

5. Re-add the pool and the virtual disks to the cluster again.

In the failover cluster manager select "Pools" on the left under storage
Right click "Pools" and select "Add Storage Pool". Select your pool and hit Ok.
Select your storage pool in the top pain, right click "Add Virtual Disk". Select all of your virtual disk and hit Ok.
Right click each virtual disk and "Add to Cluster Shared Volumes"

6. Start your VMs back up. You may bave to online the VM's resources if they're not online.

The downside to this process is if you only have one virtual disk that is in a failed/detached/no redundancy state and all the others are fine. You have to take all the virtual disks and the pool out of the cluster to peform the recovery. You may have valid/healthy virtual disks that need to be down (not in the cluster and not being exposed as a CSV) thus you may have VMs that have to be down when their underlying storage is healthy. You could move these to a different location prior to performing the above procedure. Just something to be aware of.

4.20.2018 Updated and Better Solution:
After I wrote this post, I found the following blogpost and gave it a shot. This solution seems to do the trick and it doesn't require taking the storag pool offline or any of your virtual disks that are in perfectly working shape! The gist of it is this:

Remove the disk from CSV. Set diskrunchkdsk and diskrecoveryaction cluster parameters on the disk. Start the disk and start the recovery. Let the recovery job finish. Then stop the disk, revert the cluster paramater settings, add it back to CSV and start the disk.

Remove-Clustersharedvolume -name "Cluster Disk 1"

Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrunchkdsk -Value 7

Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrecoveryaction -Value 1

Start-clusterresource -Name "Cluster Disk 1"

Get-ScheduledTask -TaskName "Data Integrity Scan for Crash Recovery" | Start-ScheduledTask

Storage jobs will start, once they finish run:

Stop-clusterresource -Name "Cluster Disk 1"

Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrecoveryaction -Value 0

Get-Clusterresource -Name "Cluster Disk 1" | set-clusterparameter -name diskrunchkdsk -value 0

Add-clustersharedvolume -Name "Cluster Disk 1"

Start-clusterresource -Name "Cluster Disk 1"

Removing the CSV, adding the CSV and starting/stopping the resource can be done via the FCM

5.11.2018 Update
If you have a 2-node cluster or if you're using 2-way mirroring, read this. You might have a disk that is failing and you might not be able to recover the virtual disk/volume. Check for unrecoverable read errors.

5 comments:

VladimirMay 3, 2018 at 2:06 AM
thank you VERY MUCH! You save us!
HarmenJune 14, 2018 at 5:03 AM
Thanks a lot as well. You just saved us 3 days of work. The method described on 4.20.2018 worked like a charm
UnknownJune 28, 2018 at 1:51 PM
Thank you for sharing...I don't know how anyone finds out about these attributes "diskrunchkdsk" or "diskrecoveryaction" and what values are acceptable...but this is amazing stuff.

This saved us a lot of additional downtime, and very possibly some data loss...we're not overly thrilled with S2D and Hyper-V right now, but I imagine the alternative isn't all sunshine all the time either.
chandleyaAugust 28, 2018 at 6:22 PM
Excellent work. I hope to blog about my adventures in using S2D for FCI in the near future.

Thursday, April 19, 2018

S2D Recovering a Detached Virtual Disk with No Redundancy

5 comments: