Thursday, February 23, 2017

S2D Continually Refresh Job and Disk Status

In storage spaces direct you can run Get-StorageJob to see the progress of rebuilds/resyncs. The following powershell snippet allows you to continually refresh the status of the rebuild operation so that you know when things are back to normal.

function RefreshStorageJobStatus () { while($true) { Get-VirtualDisk | ft; Write-Host "-----------";  Get-StorageJob;Start-Sleep -s 1;Clear-Host; } }

Enter the above in powershell on one line. Then enter "RefreshStorageJobStatus" to start the script. The output should look similar to the following and refresh every second:

Name   IsBackgroundTask ElapsedTime JobState  PercentComplete BytesProcessed BytesTotal
----   ---------------- ----------- --------  --------------- -------------- ----------
Repair True             00:00:13    Suspended 0               0              7784628224
Repair True             00:00:06    Suspended 0               0              7784628224



FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
vd01                               OK                Healthy      True           1 TB
vd03                               Degraded          Warning      True           1 TB
vd02                               Degraded          Warning      True           1 TB
vd04                               OK                Healthy      True           1 TB


You can press ctrl-c to stop the execution.

Monday, February 13, 2017

AD-less S2D cluster bootstrapping

AD-less S2D cluster bootstrapping - Domain Controller VM on Hyper-converged Storage Spaces Direct


Is it a supported scenario to run a AD domain controller in a VM on a hyper-converged S2D cluster? We're looking to deploy a 4-node hyper-converged S2D cluster at a remote site. We would like to run the domain controller for the site on the cluster so we don't need to purchase a 5th server. Will the S2D cluster be able to boot if the network links to the site are down (meaning other domain controllers are not accessible)? I know WS2012 allowed for AD-less cluster bootstrapping but will the underlying mechanics uses for storage access in S2D in WS2016 work without AD? Is this a supported scenario? AD-less S2D cluster bootstrapping?

I asked this question in the Microsoft forums. I did not get a definitive answer from anyone. So I set it up and tested it and it appears to work. I don't know if it's officially supported or not but it does work. The S2D virtual disks and volumes comes up with out a domain controller. At which point you can start the domain controller VM if it did not start automatically. I didn't dig into things, but I have a feeling it's using NTLM authentication and would likely fail if your domain requires Kerberos?

Friday, January 29, 2016

3 Node Storage Spaces Direct Cluster Works!!!

I went through the following URL https://technet.microsoft.com/en-us/library/mt126109.aspx but instead of creating a 4 node Storage Spaces Direct cluster, I decided to try and see if a 3 node cluster would work. Microsoft documentation says that they will only support Storage Spaces Direct with 4 servers but I thought it can't hurt to try a 3 node... and it worked!!

I did this all with virtual machines and CTP4. So I skipped the RDMA part and just setup two virtual switches. One for internal traffic and one for external. I had to add some steps in so the guest would think the virtual hard drives were either SSDs or HDDs. I also skipped the multi-resilient disks until after testing straight virtual disks. 

So I have 3 nodes. Each node has one 400GB "SSD" and one 1TB "HDD".


  1. Install-WindowsFeature –Name File-Services, Failover-Clustering –IncludeManagementTools
    1. #at this point, I hot added the 3 1tb disks to VMs
  2. Test-Cluster –Node s2dtest01,s2dtest02,s2dtest03 –Include “Storage Spaces Direct”,Inventory,Network,”System Configuration”
  3. New-Cluster –Name s2dtest –Node s2dtest01,s2dtest02,s2dtest03  –NoStorage –StaticAddress 192.168.1.213
    1. #ignore warnings
    2. #if disaggregated deployment, ensure ClusterAndClient access w/ Get-ClusterNetwork & Get-ClusterNetworkInterface. Not needed for hyper-converged deployments.
  4. Enable-ClusterS2D
    1. #ihis is just for ssd and hdd configs
    2. #optional parameters require for all flash or NVMe deployments
  5. New-StoragePool -StorageSubSystemName s2dtest.test.local -FriendlyName pool01 -WriteCacheSizeDefault 0 -ProvisioningTypeDefault Fixed -ResiliencySettingNameDefault Mirror -PhysicalDisk (Get-StorageSubSystem  -Name s2dtest.test.local | Get-PhysicalDisk)
    1. Get-StoragePool -FriendlyName pool01 | Get-PhysicalDisk  #should see 2 1tb disks
    1. Get-PhysicalDisk | Where Size -EQ  1097632579584 | Set-PhysicalDisk -MediaType HDD #set the 1tb disks to hdd type
    1. #I hot added the 3 400gb disks to VMs at this point
    1. Get-StoragePool -IsPrimordial $False | Add-PhysicalDisk -PhysicalDisks (Get-PhysicalDisk -CanPool $True) #add new disks to pool
    1. Get-StoragePool -FriendlyName pool01 | Get-PhysicalDisk  #should see 4 1tb disks and 4 400gb disks, for a total of 8
    1. Get-PhysicalDisk | Where Size -EQ  427617681408 | Set-PhysicalDisk -MediaType SSD #set the 400gb disks to ssd type
  6. Get-StoragePool pool01 | Get-PhysicalDisk |? MediaType -eq SSD | Set-PhysicalDisk -Usage Journal
  7. New-Volume -StoragePoolFriendlyName pool01 -FriendlyName vd01 -FileSystem CSVFS_ReFS -Size 1000GB -ResiliencySettingName Mirror -PhysicalDiskRedundancy 1 -NumberOfColumns 1

#scale out file server…
  1. New-StorageFileServer  -StorageSubSystemName s2dtest.test.local -FriendlyName sofstest -HostName sofstest -Protocols SMB
  2. New-SmbShare -Name share -Path C:\ClusterStorage\Volume1\share\ -FullAccess s2dtest01$, s2dtest02$,s2dtest03$,test\administrator,s2dtest$,sofstest$
  3. Set-SmbPathAcl -ShareName share

Now I tested. Everything continues to work if any of the nodes die! I tried killing each node one at a time and the virtual disk, the volume and the SOFS share were all still up and still accessible. 2-way mirroring worked with a 3 node S2D setup.

Then I created a single parity space and setup sofs share with the following:

  1. New-Volume -StoragePoolFriendlyName pool01 -FriendlyName vd02 -FileSystem CSVFS_ReFS -Size 500GB -ResiliencySettingName Parity -PhysicalDiskRedundancy 1 -NumberOfColumns 3
  2. New-SmbShare -Name share2 -Path C:\ClusterStorage\Volume2\share\ -FullAccess s2dtest01$, s2dtest02$,s2dtest03$,test\administrator,s2dtest$,sofstest$
  3. Set-SmbPathAcl -ShareName share2

It continues to work if any of the nodes die! I tried killing each node one at a time and the virtual disk, the volume and the SOFS share were all still up and still accessible. Single parity worked with a 3 node S2D setup.

I then added another 1TB disk to each of the 3 nodes and then tried to create a 2-way mirror with two columns, a 3-way mirror with 1 column, a 3-way mirror with 2 columns and a parity space with 6 columns.

  1. Get-StoragePool -IsPrimordial $False | Add-PhysicalDisk -PhysicalDisks (Get-PhysicalDisk -CanPool $True)
  2. Get-PhysicalDisk | Where Size -EQ  1097632579584 | Set-PhysicalDisk -MediaType HDD
  3. Optimize-StoragePool
  4. New-Volume -StoragePoolFriendlyName pool01 -FriendlyName vd03 -FileSystem CSVFS_ReFS -Size 500GB -ResiliencySettingName Mirror -PhysicalDiskRedundancy 2 -NumberOfColumns 1
  5. New-Volume -StoragePoolFriendlyName pool01 -FriendlyName vd04 -FileSystem CSVFS_ReFS -Size 500GB -ResiliencySettingName Mirror -PhysicalDiskRedundancy 1 -NumberOfColumns 2
  6. New-Volume -StoragePoolFriendlyName pool01 -FriendlyName vd05 -FileSystem CSVFS_ReFS -Size 500GB -ResiliencySettingName Mirror -PhysicalDiskRedundancy 2 -NumberOfColumns 1
  7. New-Volume -StoragePoolFriendlyName pool01 -FriendlyName vd06 -FileSystem CSVFS_ReFS -Size 500GB -ResiliencySettingName Parity -PhysicalDiskRedundancy 1 -NumberOfColumns 6
  8. New-SmbShare -Name share3 -Path C:\ClusterStorage\Volume3\share\ -FullAccess s2dtest01$, s2dtest02$,s2dtest03$,test\administrator,s2dtest$,sofstest$
  9. New-SmbShare -Name share4 -Path C:\ClusterStorage\Volume4\share\ -FullAccess s2dtest01$, s2dtest02$,s2dtest03$,test\administrator,s2dtest$,sofstest$
  10. Set-SmbPathAcl -ShareName share3
  11. Set-SmbPathAcl -ShareName share4

Well, the 6 column parity did not work and neither did the 3-way mirror with 2 columns. The PowerShell command would not take. That was somewhat expected. It appears that the resiliency is dependent on the fault domains. The 2-way and 3-way mirror w/ 1 column were created though and they both continued to work throughout any single node failure. The 3-way mirror could not withstand a two node failure though. Perhaps it could withstand two disks? Something to try another day. I wanted to see if the multi-resilient disks would work in a 3 node S2D with a single parity space. So I wiped away all the virtual disks and started over.

  1. Remove-SmbShare share
  2. Remove-SmbShare share2
  3. Remove-SmbShare share3
  4. Remove-SmbShare share4
  5. Remove-VirtualDisk vd01
  6. Remove-VirtualDisk vd02
  7. Remove-VirtualDisk vd03
  8. Remove-VirtualDisk vd04
  9. New-StorageTier -StoragePoolFriendlyName pool01 -FriendlyName MT -MediaType HDD -ResiliencySettingName Mirror -NumberOfColumns 2 -PhysicalDiskRedundancy 1
  10. New-StorageTier -StoragePoolFriendlyName pool01 -FriendlyName PT -MediaType HDD -ResiliencySettingName Parity -NumberOfColumns 3 -PhysicalDiskRedundancy 1
  11. $mt = Get-StorageTier MT
  12. $pt = Get-StorageTier PT
  13. New-Volume -StoragePoolFriendlyName pool01 -FriendlyName vd01_multiresil -FileSystem CSVFS_ReFS -StorageTiers $mt,$pt -StorageTierSizes 100GB, 900GB
  14. New-SmbShare -Name share -Path C:\ClusterStorage\Volume1\share\ -FullAccess s2dtest01$, s2dtest02$,s2dtest03$,test\administrator,s2dtest$,sofstest$
  15. Set-SmbPathAcl -ShareName share



It appeared to create it. I tested failed each node individually and it appeared to work. So in conclusion, it looks like you can build a 3 node Storage Spaces Direct cluster and use multi-resilient disks!!! Granted you can only have one node failure but that's fine by me.


I emailed Microsoft and asked them about supporting 3 node S2D. They said to stay tuned on support of 3 node deployments… Sounds like and looks like it will be coming!

Storage Spaces and Latent Sector Errors / Unrecoverable Read Errors

I emailed S2D_Feedback@microsoft.com to ask about storage spaces direct and how it handles Latent Sector Errors (LSE), otherwise known as Unrecoverable Read Errors. Here is the email I sent:

My company is in the process of evaluating different options for upgrading our production server environment. I’m tasked with finding a solution that meets our needs and is within our budget.

I’m trying to compare and contrast storage spaces direct with storage spaces utilizing JBOD enclosures. Data resiliency, integrity and availability are paramount. So I’m primary looking at both of these technologies from that perspective. Thus, if we go the JBOD route, we’re looking at implementing 3 enclosures and utilizing the enclosure awareness of storage spaces. This solution has existed longer then storage spaces direct and I would think has been tested more thoroughly. I like the scalability and elegance of storage space direct though. From a conceptual overview and a hardware setup perspective it just seems easier to grasp and it seems like a better solution.

My question is, how do both of these setups handle unrecoverable read errors/latent sector errors? Does one solution handle them better than the other?


There are horror stories about hardware RAID controllers evicting drives because of URE/LSE and then during RAID rebuilds encountering additional UREs/LSEs and bricking the storage. This is more worrisome when SATA disks are used (due to UREs/LSEs occurring more often and sooner with SATA disks compared to SAS disks.) How does storage spaces/S2D differ in this regard? I know one of the selling points of S2D is the use of SATA disks. I’m curious as to how this problem has been addressed since SATA disks are being promoted. What happens if there is a URE/LSE in end user data? What happens if there is a URE/LSE in the metadata used by storage spaces/S2D or the underlying file system?

Here is the response I received:

Both Spaces direct and Shared Spaces (with JBOD)  both rely on the same software raid implementation, difference is in the connectivity.  Software raid implementation does not throw away the entire drive on failure, we trigger activity to move the data out of the drive while keeping the copy till data is moved (if we have copies available).  On Write failure we try to move the impacted range right away while background activity is moving the untouched data out of the disk,  some of the disks fail to write but they can continue to support reads in which case the data on those drives can still be used to serve user requests.  Until the data on the failed drive is rebuilt on spare capacity the drive is not removed, user can still force but not automated.  On URE - we trigger rebuilt to recover lost copy, this is triggered both when reads errors detected while satisfying user error or by back ground scrub process.  Back ground scrub process detects URE by validating sector level checksum across copies and  validating. 

So it would appear that if you utilize storage spaces you don't have to worry about a LSE/URE taking out a drive and then a subsequent LSE/URE taking out another drive, thus taking down your array. 

Tuesday, July 14, 2015

IIS Application Initialization Quick Reference


  1.  Need to install application initialization module, if IIS 7.5 it's a separate download or WPI, included in IIS 8
  1. Set startMode to AlwaysRunning on application pool
    1. Open configuration editor in IIS on server root
    2. Select system.applicationHost/applicationPools
    3. Click edit items
    4. Find app pool, select and then change startMode in lower pane to alwaysRunning
    5. Hit Apply This changes C:\Windows\System32\inetsrv\config\applicationHost.config

  1. Set applicationDefault preloadEnabled on site
    1. Open configuration editor in IIS on server root
    2. Select system.applicationHost/sites
    3. Click edit items
    4. Find site and select
    5. Expand applicationDefaults in lower pane
    6. change preloadEnabled to true
    7. Hit Apply This changes C:\Windows\System32\inetsrv\config\applicationHost.config
    8. Note: This only changes the defaults for new apps, you may need to change the existing ones yet. You may not need step 3 if apps already exist…

  1. Set preloadEnabled on site
    1. Open C:\Windows\System32\inetsrv\config\applicationHost.config
    2. Find site you're looking for <site name="www.domain.com" id=…
    3. Add preloadEnabled="true"  to first application element tag
    4. You should see a  <applicationDefaults preloadEnabled="true" /> under the site element if you performed step 3
    5. Restart IIS

  1. Set initalizationPage on site and doAppInitAfterRestart
    1. Open configuration editor in IIS on desired site
    2. Select system.webServer/applicationInitialization
    3. Change from to ApplicationHost.config if you want to create a location element in C:\Windows\System32\inetsrv\config\applicationHost.config, otherwise the change will go in the web.config
    4. Set doAppInitAfterRestart to true
    5. Click edit items
    6. Click add
    7. Enter path for initializationPage ( /folder/page?param=something ), leave hostname blank (I think…)
    8. Click apply


CHANGE IDLE TIME_OUT IN ADVANCED SETTINGS ON APPPOOL TO 0

Monday, July 13, 2015

DPM 2010 Slow when Selecting Roverypoint to Recover

It took almost an hour to select a date and time to recover in DPM 2010. It appeared the this was because of high cpu usage by SQL. The query that seemed to be responsible for this was:

SELECT Path, FileSpec, IsRecursive
    FROM tbl_RM_RecoverableObjectFileSpec
    WHERE RecoverableObjectId = @RecoverableObjectId AND 
          DatasetId = @DatasetId and
          iSgcED = 0

I didn't dig into things too much, but it appeared as though it was running this query for every single recovery point for the item selected and it was doing a clustered index scan for each recovery point. I created the following statistic and covering nonclustered index in the DPM db:

CREATE STATISTICS [_dta_stat_1042102753_9_2_3] ON [dbo].[tbl_RM_RecoverableObjectFileSpec]([IsGCed], [RecoverableObjectId], [DatasetId])

CREATE NONCLUSTERED INDEX [_dta_index_tbl_RM_RecoverableObjectFileSpec_7_1042102753__K2_K3_K9_5_6] ON [dbo].[tbl_RM_RecoverableObjectFileSpec] 
(
[RecoverableObjectId] ASC,
[DatasetId] ASC,
[IsGCed] ASC
)
INCLUDE ( [FileSpec],
[IsRecursive]) WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

Now the above query changed from a clustered index scan to a index seek and key lookup. The time it took to select a recovery point went from about an hour down to a minute.

Wednesday, January 28, 2015

Email When Scheduled Task Fails

1. Create a new scheduled task
2. On the actions tab, click new
3. Change the action to send an e-mail
4. Enter the to, from, subject, text and smtp server information
 For subject and text enter something along the lines of "A schedule task failed on server so and so"
5. Click OK

6. On the triggers tab click new
7. Change the "Begin the task" drop down to "On an event"
8. Click "Custom" under settings
9. Click "New Event Filter"
10. Click the XML tab
11. Check Edit query manually
12. Enter the following XML for the query:

<QueryList>
  <Query Id="0" Path="Microsoft-Windows-TaskScheduler/Operational">
    <Select Path="Microsoft-Windows-TaskScheduler/Operational">*[System[Provider[@Name='Microsoft-Windows-TaskScheduler'] and (EventID=201) ]]
and
*[EventData[Data[@Name='ResultCode'] and (Data='1')]] </Select>
  </Query>
</QueryList>

Now anytime a schedule task completes with a result code of 1, an email will go out letting you know. 1 in our case indicates an error. What ever task you have may return other codes to indicate errors. You could say were data > 0.