Thursday, May 10, 2018

The Case Against 2-Node S2D Solutions and 2-Way Mirroring

Update: 11/8/2018 Documentation is out for Windows Server 2019 that shows how MS solved the problem. This doesn't solve the problem though if you still decide to utilize 2-way mirroring. Just don't do it. Read on if you want to see what the problem was.

So I've got a two node S2D cluster cooking. The last two times I patched it one of the volumes lost redundancy (just one volume.) The first time it happened I couldn't figure out how to fix it. I ended up blowing the volume away, creating a new one and restoring from backups. This lead me down the path of trying to figure out how to fix this issue in the future which lead to this blog post.

The second time my volume lost redundancy after rebooting a server I thought I was ready for it, since I figure out how to resolve the no redundancy state. Preparing for the worst though, I copied all the VMs off of it so I would have more recent state then from backup. All of the VMs copied except for one. I don't recall the error message it gave me but I think it said something about being unable to read from the disk. This should have been my first clue as to the root cause.

In any case, I had a volume with no redundancy and I attempted the steps I discovered to recover the volume. It didn't work. No matter what I tried. I ended up blowing away the volume again, recreating a new volume and restoring the VMs.

After further investigation it would appear that one of the disks is going bad. I determined this by running the following:

> Get-PhysicalDisk | Get-StorageReliabilityCounter

DeviceId Temperature ReadErrorsUncorrected Wear PowerOnHours
-------- ----------- --------------------- ---- ------------
2                    0                     0    114
5000                                       0    1430
5012                 648                   0    1064
5004                                       0    1417
5006                 0                     0    1051
5010                 0                     0    1064
5009                 0                     0    1051
5003                                       0    1417
5011                 0                     0    1064
5008                 0                     0    1050
5013                 0                     0    1064
5007                 0                     0    1050
5001                                       0    1430

When I run Get-PhysicalDisk all the disks return healthy though. So, the disk is starting to have issues but not enough for the system to think the disk is total garbage yet?

Turns out when I restarted the server without the failing disk, the server WITH the failing disk was the only source of data and it couldn't read form all portions of the failing drive. Hence the no redundancy state. I'm thinking it couldn't read a sector and it couldn't find a redundant copy of the data. Now if this was a three node cluster with 3-way mirroring it could have read from the tertiary copy.

I'm not sure why S2D doesn't take a more proactive approach to resolve the failing disk or at least highlight it more. I'm also not sure why it wouldn't allow me to attach the disk after both nodes were back online. Perhaps the "There is not enough redundancy remaining to repair the virtual disk" warning was because S2D wanted to try and move the data but I needed to add another disk? I was only using 4 TB out of 24TB though, you'd think S2D could move everything off the failing disk to the available space... Perhaps it couldn't attach the disk because changed data could not be replayed from the failing disk to restore the mirror?

I would rather S2D evict the disk right away and make the issue at hand obvious. Or create another health state that indicates a physical disk is in a failing state and trickle that status all the way up to the virtual disks and volumes as unhealthy. Or give us an option to set the number of URE (unrecoverable read errors) threshold for failing a disk.

Long story short, check your disks and make sure none of them are going bad before your reboot servers if you have a two node S2D cluster or if you implement two-way mirroring. Also if you do encounter the no redundancy state it's best to copy as much data off of it as you can before trying to fix it.

Another take away from this is that it would seem wiser to create more smaller volumes instead of fewer larger volumes, of course keeping the number of volumes a multiple of your node count.

Update: I got the following response from Microsoft
"The challenge here is that you had a misbehaving drive… and that’s kind of a gray area.  We handle very well when drives work great… and we handle very well when they fail completely.  But when does bad… become bad enough?  And how do we balance not generating false positives that makes you go replacing drives unnecessarily, and pointlessly wasting money.   With that said, this is an area we are working on.  In Windows Server 2019 we are making enhancements to our Health Service to add what we term marginal drive handling right now (we’ll come up with a better name by ship).

We also hear the feedback that some customers may want higher resiliency out of a 2-node solution, that is another problem we are looking at.  Be mindful that it will come at a cost of reduced efficiency…  but we want to offer customers the choice to do what makes sense for their deployment scenario."
So Microsoft is workign on improving the experience with 2-way mirrors and 2-node S2D deployments. I commend the S2D team, they're very responsive to emails and they listen to what customers have to say. I'm excited to see the improvements with Windows Server 2019.

I just wish there was a way to tweak the algortihm that decides when a drive is bad. I'd personally fail it sooner then later.

Update: Get-PhysicalDisk | Get-StorageReliabilityCounter |  ft DeviceId,ReadErrorsTotal,ReadLatencyMax,WriteErrorsTotal,WriteLatencyMax -AutoSize


  1. Hi, do you have any document to do s2d in good manner, can you please share it

  2. I don't really have anything but check out these resources:

    People are very helpful on the slack channels:

  3. Wouldn't the marginal disk problems be resolved by using a hardware array controller in each node? That way the OS would always think the disk is good until enough drives fail to compromise the raid.