Tuesday, October 9, 2018

Insightful WMI Snippets


Get all the namespaces
function Get-CimNamespaces ($ns="root") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Get-CimNamespaces $("$ns\" + $_.Name)
   }
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns
}
Get-CimNamespaces 

Get-CimNamespaces | select Name, @{n='NameSpace';e={$_.CimSystemProperties.NameSpace}}, CimClass

View all the namespaces in a hierachy
function View-CimNamespacesHierarchy ($ns="root", $c="|") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Write-Host $c $_.Name
       View-CimNamespacesHierarchy $("$ns\" + $_.Name) $c"-"
   }
}
View-CimNamespacesHierarchy


Get all the WMI providers on the system
function Get-CimProvider ($ns="root") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Get-CimProvider $("$ns\" + $_.Name)
   }
   Get-CimInstance -NameSpace $ns -Class __Win32Provider
}
Get-CimProvider

Get-CimProvider | select Name, @{n='NameSpace';e={$_.CimSystemProperties.NameSpace}}, CimClass

Get-CimProvider | Measure-Object
=> 188

Get all the provider registrations. A provider can be registered multiple times w/ different registration types.
function Get-CimProviderRegistration ($ns="root") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Get-CimProviderRegistration $("$ns\" + $_.Name)
   }
   Get-CimInstance -NameSpace $ns -Class __ProviderRegistration
}
Get-CimProviderRegistration

Get-CimProviderRegistration | Measure-Object
=> 303

Get-CimProviderRegistration | Group-Object {$_.CimSystemProperties.ClassName} | select count, name

=>
Count Name                              
----- ----                              
   11 __EventConsumerProviderRegistration
  135 __InstanceProviderRegistration    
    2 __PropertyProviderRegistration    
  115 __MethodProviderRegistration      
   34 __EventProviderRegistration       
    6 __ClassProviderRegistration       

Wednesday, September 19, 2018

WMI root\cimv2 Hierarchy Visualization

Update: I made a mobile friendly version. You can find it at http://www.kreel.com/wmi. If you add it to your home screen and then open it, it should open full screen and give you more real-estate (at least on the iphone.)

While digging through the room\cimv2 WMI namespace I wanted to visually see of all the parent child relationships. So I put together a quick visualization.

You can see it here: http://www.kreel.com/wmi_hierarchy.html



It's just a tree starting with all the classes that do not inherent from any parent. I created a "no parent" node that all the classes without a parent fall under. This just made it simpler/quicker to get the visualization done.

The page is not mobile optimized.

Pan around with the mouse.

Zoom in and out with the mouse's scroll wheel

Search for a specific WMI class in the root\cimv2 namespace. The exact name is needed for the search to work. Enter the name, click "go" and it will take you to the class in the visualization. The search does not work with partial names. It is case-insensitive though and there is an autocomplete that should get you the class you're looking for. (The auto-complete population does do partial names.)

The code isn't the prettiest, I put it together really quick. There is much to be improved. I thought about modifying it to show all the associations between the classes...

Tuesday, July 31, 2018

Hierarchical View of S2D Storage Fault Domains

You can view your storage fault domains by running the following command:
Get-StorageFaultDomain

The problem is that this is just a flat view of everything in your S2D cluster.

If you want to view specific fault domains you can use the -Type parameter on Get-StorageFaultDomain.

For example, to view all the nodes you can run:
Get-StorageFaultDomain -Type StorageScaleUnit

The options for the type parameter are as follows:

StorageSite
StorageRack
StorageChassis
StorageScaleUnit
StorageEnclosure
PhysicalDisk

This is in order from broadest to most specific. Most are self-explanatory.

  • Site represents different physical locations/datacenters.
  • Rack represents different racks in the datacenter.
  • Chassis isn't obvious at first. It's only used if you have blade servers and it represents the chassis that all the blade servers go into.
  • ScaleUnit is your node or your server.
  • Enclosure is if you have multiple backplanes or storage daisy chained to your server.
  • Disk is each physical the disk.

The default fault domain awareness is the storage scale unit. So data is distributed and made resilient across nodes. If you have multiple blade enclosures, racks or datacenters you can change this so that you can withstand the failure of any one of those things.

I'm not sure if you can or would want to change the fault domain awareness to StorageEnclosure or PhysicalDisk?

These fault domains are hierarchical. Disks belongs to an enclosure, an enclosure belongs to node (StorageScaleUnit), a node belongs to a chassis, a chassis belongs to rack and a racks belongs to site.

Since most people are just using node fault domains I made the following script to show your fault domains in a hierarchical layout beneath StorageScaleUnit. The operational status for each fault domain is included.

Get-StorageFaultDomain -Type StorageScaleUnit | %{Write-Host $Tab $Tab $_.FriendlyName - $_.OperationalStatus;$_ | Get-StorageFaultDomain -Type StorageEnclosure | %{Write-Host $Tab $Tab $Tab $Tab  $_.UniqueID - $_.FriendlyName  - $_.OperationalStatus;$_ | Get-StorageFaultDomain -Type PhysicalDisk | %{ Write-Host $Tab $Tab $Tab $Tab $Tab $Tab $_.SerialNumber - $_.FriendlyName - $_.PhysicalLocation - $_.OperationalStatus} } }

You could easily modify this to add another level if you utilize chassis, rack or site fault domain awareness.

Tuesday, July 10, 2018

The Proper Way to Take a Storage Spaces Direct Server Offline for Maintenance?

Way back in September 2017 a Microsoft update changed the behavior of what happened when you suspended(paused)/resumed a node from a S2D cluster. I don't think the article "Taking a Storage Spaces Direct Server offline for maintenance" has been updated to reflect this change?

Previous to the September 2017 update, when you suspended a node, either view powershell Suspend-ClusterNode or via the Failover Cluster Manager GUI Pause option, the operation would put all the disks on that node in maintenance mode. Then when you resumed the node, the resume operation would take the disks out of maintenance mode.

The current suspend/resume logic does nothing w/ the disks. If you suspend a node it's disks don't go into maintenance mode and if you resume the node nothing is done to the disks.

I postulate what you need to do after you suspend/drain the node and prior to shutting it down or restarting it is put the disks for that node into maintenance mode. This can be done with the following powershell command:

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "node1"} | Enable-StorageMaintenanceMode

Be sure to change "node1" to the name of the node you've suspended in the above powershell snippet.

When the node is powered on/rebooted, prior to resuming the node, you need to take the disks for that node out of maintenance mode. Which can be done with the following command:

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "node1"} | Disable-StorageMaintenanceMode

The reason I'm thinking this should be done is that if you just reboot the node without putting the disk in maintenance mode then the cluster will behave as if it lost that node. Things will recover eventually but timeouts may occur and if you're system is extremely busy with IO bad things could happen (VMs rebooting, CSVs moving, etc.) I'm thinking it's better to put the disks for the node in maintenance mode so all the other nodes know what's going on and the recovery logic doesn't need to kick in. Think of it this way, it's better to tell all the nodes what's going on then to make them have to figure out what's going on... I need to test this theory some more...

Update: It looks like the May 8th 2018 update (KB103723) "introduced SMB Resilient Handles for the S2D intra-cluster network to improve resiliency to transient network failures. This had some side effects in increased timeouts when a node is rebooted, which can effect a system under load. Symptoms include event ID 5120’s with a status code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a node is rebooted." The above procedure is the work around until a fix is available.

Another symptom, at least for me, was connections to a guest SQL cluster timing out. When a node was rebooted prior to the May update everything was fine. After applying the May update and rebooting a node, SQL time outs would occur.

Update: MS released an article and for the time being you should put the disks in maintenance mode prior to rebooting it would seem. Also you might want to disable live dumps it would appear.


Friday, June 29, 2018

View Physical Disks by Node in S2D Cluster

Quick snippets to view physical disks by node in S2D cluster:

Get-StorageNode |%{$_.Name;$_ | Get-PhysicalDisk -PhysicallyConnected}

The following is useful if you're looking at performance monitor and you're trying to figure out which device number is which:

gwmi -Namespace root\wmi ClusPortDeviceInformation | sort ConnectedNode,ConnectedNodeDeviceNumber,ProductId | ft ConnectedNode,ConnectedNodeDeviceNumber,ProductId,SerialNumber

Thursday, May 17, 2018

Trouble Shooting S2D / Clusters

So you have a problem with S2D or you just fixed a problem and you want to try and figure out why it happened so it doesn't happen again. Execute the Get-ClusterDiagnosticInfo powershell script. It will create a zip file in c:\users\<username>\ that contains the cluster logs and all relavent events and settings. Do this right away so you have the data. You can always analyze it at a later date.

Sunday, May 13, 2018

S2D Replacing PhysicalDisk Quick Reference

List phsyical disks to find failed disk. Note serial number.
Get-PhysicalDisk
Get-PhysicalDisk | Get-StorageReliabilityCounter

List virtual disks that use the drive, remember them for later.
Get-PhysicalDisk -SerialNumber A1B2C3D4 | Get-VirtualDisk

"Retire" the Physical Disk to mark the drive as inactive, so that no further data will be written to it.
$Disk = Get-PhysicalDisk -SerialNumber A1B2C3D4
$Disk | Set-PhysicalDisk -Usage Retired

S2D should start a to rebuild the virtual disks that utilized the drive.
Get-StorageJob

To be extra safe, run the following on each of the virtual disks that was listed above.
Repair-VirtualDisk -FriendlyName 'VirtualDiskX'

This storage jobs will likely take some time.

Remove the retired drive from the storag pool.
Get-StoragePool *S2D* | Remove-PhysicalDisk –PhysicalDisk $Disk


Physically remove the bad disk.
Physically add a new disk (could peform this first if you have empty drive bays) and check to see if it was added to the storage pool.
Get-PhysicalDisk | ? CanPool –eq True

If nothing is returned it should have been added to the pool, this is what you want as S2D should claim all disks.

If it wasn't added to the pool try the following:
$newDisk = Get-PhysicalDisk | ? CanPool –eq True
Get-StoragePool *S2D* | Add-PhysicalDisk –PhysicalDisks $newDisk –Verbose

Find the new disk's serial number and then see if any virtual disks are using it. None should be yet.
Get-PhysicalDisk -SerialNumber NEWSNBR | Get-VirtualDisk

Rebalance storage pool
Get-StoragePool *S2D* | Optimize-StoragePool
Get-VirtualDisk | Repair-VirtualDisk

Now virtual disks should be using it
Get-PhysicalDisk -SerialNumber NEWSNBR | Get-VirtualDisk


Friday, May 11, 2018

Resize S2D Volume Quick Reference

Get-VirtualDisk vd01
Get-VirtualDisk vd01 | Get-StorageTier | Resize-StorageTier -Size 1.5TB
Get-VirtualDisk vd01


Get-VirtualDisk vd01| Get-Disk | Get-Partition | Get-Volume
$VirtualDisk = Get-VirtualDisk vd01
$Partition = $VirtualDisk | Get-Disk | Get-Partition | Where PartitionNumber -Eq 2
$Partition | Resize-Partition -Size ($Partition | Get-PartitionSupportedSize).SizeMax
Get-VirtualDisk vd01| Get-Disk | Get-Partition | Get-Volume

Thursday, May 10, 2018

The Case Against 2-Node S2D Solutions and 2-Way Mirroring

Update: 11/8/2018 Documentation is out for Windows Server 2019 that shows how MS solved the problem. This doesn't solve the problem though if you still decide to utilize 2-way mirroring. Just don't do it. Read on if you want to see what the problem was.


So I've got a two node S2D cluster cooking. The last two times I patched it one of the volumes lost redundancy (just one volume.) The first time it happened I couldn't figure out how to fix it. I ended up blowing the volume away, creating a new one and restoring from backups. This lead me down the path of trying to figure out how to fix this issue in the future which lead to this blog post.

The second time my volume lost redundancy after rebooting a server I thought I was ready for it, since I figure out how to resolve the no redundancy state. Preparing for the worst though, I copied all the VMs off of it so I would have more recent state then from backup. All of the VMs copied except for one. I don't recall the error message it gave me but I think it said something about being unable to read from the disk. This should have been my first clue as to the root cause.

In any case, I had a volume with no redundancy and I attempted the steps I discovered to recover the volume. It didn't work. No matter what I tried. I ended up blowing away the volume again, recreating a new volume and restoring the VMs.

After further investigation it would appear that one of the disks is going bad. I determined this by running the following:

> Get-PhysicalDisk | Get-StorageReliabilityCounter

DeviceId Temperature ReadErrorsUncorrected Wear PowerOnHours
-------- ----------- --------------------- ---- ------------
2                    0                     0    114
5000                                       0    1430
5012                 648                   0    1064
5004                                       0    1417
5006                 0                     0    1051
5010                 0                     0    1064
5009                 0                     0    1051
5003                                       0    1417
5011                 0                     0    1064
5008                 0                     0    1050
5013                 0                     0    1064
5007                 0                     0    1050
5001                                       0    1430

When I run Get-PhysicalDisk all the disks return healthy though. So, the disk is starting to have issues but not enough for the system to think the disk is total garbage yet?

Turns out when I restarted the server without the failing disk, the server WITH the failing disk was the only source of data and it couldn't read form all portions of the failing drive. Hence the no redundancy state. I'm thinking it couldn't read a sector and it couldn't find a redundant copy of the data. Now if this was a three node cluster with 3-way mirroring it could have read from the tertiary copy.

I'm not sure why S2D doesn't take a more proactive approach to resolve the failing disk or at least highlight it more. I'm also not sure why it wouldn't allow me to attach the disk after both nodes were back online. Perhaps the "There is not enough redundancy remaining to repair the virtual disk" warning was because S2D wanted to try and move the data but I needed to add another disk? I was only using 4 TB out of 24TB though, you'd think S2D could move everything off the failing disk to the available space... Perhaps it couldn't attach the disk because changed data could not be replayed from the failing disk to restore the mirror?

I would rather S2D evict the disk right away and make the issue at hand obvious. Or create another health state that indicates a physical disk is in a failing state and trickle that status all the way up to the virtual disks and volumes as unhealthy. Or give us an option to set the number of URE (unrecoverable read errors) threshold for failing a disk.

Long story short, check your disks and make sure none of them are going bad before your reboot servers if you have a two node S2D cluster or if you implement two-way mirroring. Also if you do encounter the no redundancy state it's best to copy as much data off of it as you can before trying to fix it.

Another take away from this is that it would seem wiser to create more smaller volumes instead of fewer larger volumes, of course keeping the number of volumes a multiple of your node count.


Update: I got the following response from Microsoft
"The challenge here is that you had a misbehaving drive… and that’s kind of a gray area.  We handle very well when drives work great… and we handle very well when they fail completely.  But when does bad… become bad enough?  And how do we balance not generating false positives that makes you go replacing drives unnecessarily, and pointlessly wasting money.   With that said, this is an area we are working on.  In Windows Server 2019 we are making enhancements to our Health Service to add what we term marginal drive handling right now (we’ll come up with a better name by ship).

We also hear the feedback that some customers may want higher resiliency out of a 2-node solution, that is another problem we are looking at.  Be mindful that it will come at a cost of reduced efficiency…  but we want to offer customers the choice to do what makes sense for their deployment scenario."
So Microsoft is workign on improving the experience with 2-way mirrors and 2-node S2D deployments. I commend the S2D team, they're very responsive to emails and they listen to what customers have to say. I'm excited to see the improvements with Windows Server 2019.

I just wish there was a way to tweak the algortihm that decides when a drive is bad. I'd personally fail it sooner then later.

Update: Get-PhysicalDisk | Get-StorageReliabilityCounter |  ft DeviceId,ReadErrorsTotal,ReadLatencyMax,WriteErrorsTotal,WriteLatencyMax -AutoSize

Thursday, May 3, 2018

Hyper-V SET and NLB in Guest, Duplicate Frames/Packets

Just a quick blurb about switch embedded teaming and Microsoft network load balancing in guest virtual machines. If you've got NLB setup in multicast operational mode running in a VM on top of a Hyper-V host that has SET configured, your VM will receive duplicate frames/packets. It would appear that all the NICs in the SET receive the data and pass it up the network stack. If your upper layer network protocols handle duplicat packets you don't necessarily have to worry about it. ICMP ping for example does not and you will receive multiple responses from your VM. For NLB I would abondon SET and use active/passive NIC teaming.

Thursday, April 19, 2018

S2D Recovering a Detached Virtual Disk with No Redundancy

So you've got your S2D cluster and you lost redundancy. Now what? How can this happen? Well if you restart too many nodes this can happen. Lets walk through a sample scenario and see what happens.


Jump to the bottom of the article if you want to skip all the fluff and get right to fix

First the setup:
For this test I setup a 2 node cluster and I created 2 mirrored virtual disks; vd01 and vd02. I setup volumes on them and imported a couple VMs on to each volume. I setup up some VMs so that  the system would be generating IO. Check the Operational Status and the Health Status of the virtual disks by running Get-VirtualDisk. They should all be healthy. I use this script to constantly refresh the virtual disk health and the storage jobs.

Next the failure:
Restart one of the nodes. The virtual disks should have a warning and their operational status should change to a degraded or the degraded,incomplete state. You should also see some repair jobs in the suspended state.

Name   IsBackgroundTask ElapsedTime JobState  PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- --------  --------------- -------------- ----------
Repair True             00:00:12    Suspended 0               0              29527900160
Repair True             00:04:06    Suspended 0               0              36775657472



FriendlyName ResiliencySettingName OperationalStatus      HealthStatus IsManualAttach Size
------------ --------------------- -----------------      ------------ -------------- ----
vd02                               {Degraded, Incomplete} Warning      True           1 TB
vd01                               {Degraded, Incomplete} Warning      True           1 TB

Once the node comes back online the storage jobs should start running and the disks will be in service. You may see a state of degraded too while storage jobs run. The data that has changed while the node was down is being rebuilt.

Name   IsBackgroundTask ElapsedTime JobState  PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- --------  --------------- -------------- ----------
Repair True             00:00:04    Suspended 0               0              19058917376
Repair True             00:03:03    Running   16              6276775936     37580963840



FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02                               InService         Warning      True           1 TB

vd01                               InService         Warning      True           1 TB

Now before the storage jobs finish rebuilding the redundancy, restart the other node. The disks will likely go into a detached operational status.

Name   IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- -------- --------------- -------------- ----------
Repair False            00:00:00    Killed   0



FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02                               Detached          Unknown      True           1 TB
vd01                               Detached          Unknown      True           1 TB

You may see a status that says No Redundancy.

Name   IsBackgroundTask ElapsedTime JobState  PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- --------  --------------- -------------- ----------
Repair True             00:03:12    Suspended 0               0              5368709120



FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd02                               No Redundancy     Unhealthy    True           1 TB
vd01                               Detached          Unknown      True           1 TB

In this case vd02 is still online but not accessible from C:\ClusterStorage\. If you try to run a repair on the virtual disk with "No Redundancy" you get the following:

PS C:\Users\administrator.SHORELAND_NT> Get-VirtualDisk vd02 | Repair-VirtualDisk
Repair-VirtualDisk : There is not enough redundancy remaining to repair the virtual disk.
Activity ID: {64afdbc9-9ce4-4108-9aac-f4da6d277585}
At line:1 char:24
+ Get-VirtualDisk vd02 | Repair-VirtualDisk
+                        ~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (StorageWMI:ROOT/Microsoft/...SFT_VirtualDisk) [Repair-VirtualDisk], CimEx
   ception
    + FullyQualifiedErrorId : StorageWMI 50001,Repair-VirtualDisk

This test system is a clean build with terabytes of storage so it's not a space issue. If your virtual disks only says "No Redundancy" you may want to wait a bit and/or try to offline and then online the disk. This has fixed it before for me.

The same process for recreating the failure will apply for a 3 node or greater setup. Depending on your redundancy though you may have to fail multiple nodes at a time.

For the detached disks, in cluster administrator you will see that the virtual disks are in a failed state. If you try to bring a the virtual disk online through the FCM you'll get an error that says "The system cannot find the drive specified." Error Code 0x8007000F

If you try to connect the virtual disk through powershell you'll get:

Get-VirtualDisk | Where-Object -Filter { $_.OperationalStatus -eq "Detached" } | Connect-VirtualDisk

Connect-VirtualDisk : Access denied

Extended information:
Access is denied.

Recommended Actions:
- Check if you have the necessary privileges to perform the operation.
- Perform the operation from Failover Cluster Manager if the resource is clustered.

Activity ID: {583d3820-dacb-4246-93cf-b52d05d17911}
At line:1 char:82
+ ... -Filter { $_.OperationalStatus -eq "Detached" } | Connect-VirtualDisk
+                                                       ~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : PermissionDenied: (StorageWMI:ROOT/Microsoft/...SFT_VirtualDisk) [Connect-VirtualDisk],
   CimException

    + FullyQualifiedErrorId : StorageWMI 40001,Connect-VirtualDisk

Finally the solution:
Here is what you need to do.

1. Remove all the disks and the pool from the cluster
  • In the failover cluster manager select "Pools" on the left under storage
  • Select your storage pool in the top pain and then the virtual disks tab on the bottom pain
  • Right click each virtual disk and "Remove from Cluster Shared Volumes"
  • Right click each virtual disk and "Remove"
  • Right click the storage pool in the top pain and "Remove"
2. Go to the Server Manager, File and Storage Services, locate the storage pool, do right click and choose the option “Set Read-Write Access”. Choose one of the nodes. I would choose the server that you have the Server Manager pulled up on. This should allow the single node you selected to have control over the storage pool. This is the key.

3. Select the failed virtual disks, right click and try to attach. It will likely fail but it's going to start the repair process automatically. You can watch the status of the repair with Get-VirtualDisk and Get-StorageJob. Or again you can use the script I created.

Name   IsBackgroundTask ElapsedTime JobState PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- -------- --------------- -------------- ----------
Repair True             00:00:09    Running  0               0              20937965568
Repair True             00:01:46    Running  16              8020033536     47513075712



FriendlyName ResiliencySettingName OperationalStatus          HealthStatus IsManualAttach Size
------------ --------------------- -----------------          ------------ -------------- ----
vd02                               {No Redundancy, InService} Unhealthy    True           1 TB
vd01                               {No Redundancy, InService} Unhealthy    True           1 TB

The Server Manager should show a warning icon next to the virtual disk while it repairs. You may have to refresh.

4. Once the repair is done the virtual disk's operation status should go to Ok and Health Status to Healthy. The jobs should complete and there should be no more running jobs. The failed and warning icons should go away in the server manager. You may have to refresh. You will be able to attach the virtual disk now. 

5. Re-add the pool and the virtual disks to the cluster again.
  • In the failover cluster manager select "Pools" on the left under storage
  • Right click "Pools" and select "Add Storage Pool". Select your pool and hit Ok.
  • Select your storage pool in the top pain, right click "Add Virtual Disk". Select all of your virtual disk and hit Ok.
  • Right click each virtual disk and "Add to Cluster Shared Volumes"
6. Start your VMs back up. You may bave to online the VM's resources if they're not online.

The downside to this process is if you only have one virtual disk that is in a failed/detached/no redundancy state and all the others are fine. You have to take all the virtual disks and the pool out of the cluster to peform the recovery. You may have valid/healthy virtual disks that need to be down (not in the cluster and not being exposed as a CSV) thus you may have VMs that have to be down when their underlying storage is healthy. You could move these to a different location prior to performing the above procedure. Just something to be aware of.

4.20.2018 Updated and Better Solution:
After I wrote this post, I found the following blogpost and gave it a shot. This solution seems to do the trick and it doesn't require taking the storag pool offline or any of your virtual disks that are in perfectly working shape! The gist of it is this:

Remove the disk from CSV. Set diskrunchkdsk and diskrecoveryaction cluster parameters on the disk. Start the disk and start the recovery. Let the recovery job finish. Then stop the disk, revert the cluster paramater settings, add it back to CSV and start the disk.

Remove-Clustersharedvolume -name "Cluster Disk 1"

Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrunchkdsk -Value 7
Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrecoveryaction -Value 1
Start-clusterresource -Name "Cluster Disk 1"

Get-ScheduledTask -TaskName "Data Integrity Scan for Crash Recovery" | Start-ScheduledTask

Storage jobs will start, once they finish run:

Stop-clusterresource -Name "Cluster Disk 1"

Get-ClusterResource -Name "Cluster Disk 1" | Set-ClusterParameter -Name diskrecoveryaction -Value 0
Get-Clusterresource -Name "Cluster Disk 1" | set-clusterparameter -name diskrunchkdsk -value 0

Add-clustersharedvolume -Name "Cluster Disk 1"
Start-clusterresource -Name "Cluster Disk 1"



Removing the CSV, adding the CSV and starting/stopping the resource can be done via the FCM

5.11.2018 Update
If you have a 2-node cluster or if you're using 2-way mirroring, read this. You might have a disk that is failing and you might not be able to recover the virtual disk/volume. Check for unrecoverable read errors.