Saturday, January 9, 2021

Disabling Legacy Authentication in Exchange Server 2019 with a DAG

If you have a DAG and you're using a frontend load balancer or DNS round robin, disabling legacy auth at the organization level will not work and will cause everything to go haywire. Be careful disabling legacy auth at the org level if you have a cluster. 


The article Disabling Legacy Authentication in Exchange Server 2019 walks you through disabling legacy authentication in your on premise exchange environment. Similar steps can be found on docs.microsoft.com.


It would appear that the steps are all well and good if you're only running only one on premise exchange server. If you're running a DAG cluster with multiple servers handling requests then things don't go as planned. If you disabled legacy authentication at the organization level it is my experience that authentication will only work if you're hitting the node in the cluster that contains the database that the user's mailbox resides on. If you hit the other node you'll get 401 authentication errors if you're watching the traffic in fiddler. You will also get some duplicate key exceptions in the event logs.  Existing outlook clients my flash the authentication dialog on the screen randomly, so quick you can't even really see it. If you try to create new profiles and you're hitting the wrong node then it won't work.


The issue seems to be that when requests are proxied between nodes then legacy authentication needs to be enabled. When you disable legacy auth at the org level it also disables legacy auth for the proxying connection. The solution is to enable legacy auth on the computer accounts of the exchange servers.

The way I did this was I created a authentication policy called "Allow Legacy Auth" which is the exact opposite of "Block Legacy Auth" in the articles above.

New-AuthenticationPolicy -Name "Allow Legacy Auth"

I then manually set msExchAuthPolicyLink on the exchange server computer objects in AD to use this new "Allow Legacy Auth" authentication policy. The way I did this was I set the authentication policy on a test user to the new "Allow Legacy Auth" policy and then I copied the value of the msExchAuthPolicyLink attribute from that user; pasting the value into the msExchAuthPolicyLink attribute on the exchange server computer object. It should look something like this:

CN=Allow Legacy Auth,CN=Auth Policies,CN=DOMAINNAME,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=DOMAINNAME,DC=com

You can do this with ADSIEDIT or AD users and computers snap in. Be sure to restart IIS on all your nodes after doing this.

Now when I disable legacy auth at the org level things actually work.

It doesn't appear as though Microsoft fully tested disabling legacy auth with clusters. The feature needs more work, so implement this solution at your own risk. Even after enabling legacy auth on the exchange computer accounts you may still get a bunch of unhandled exceptions in the event log about duplicate keys like the following:

Log Name:      Application
Source:        MSExchange Front End HTTP Proxy
Event ID:      1003
Task Category: Core
Level:         Error
Keywords:      Classic
User:          N/A
Description:
[Ews] An internal server error occurred. The unhandled exception was: Microsoft.Exchange.Collections.TimeoutCache.DuplicateKeyException: Cannot add a duplicate key.  Use Insert instead
   at Microsoft.Exchange.Security.Authentication.FederatedAuthService.CacheReader.AddEntry(String userKey, Int32 userPolicy, ConfigWrapper config)
   at Microsoft.Exchange.Security.Authentication.FederatedAuthService.BasicAuthPolicyRepo.GetUserPolicy(String userKey, Int32 traceId, Int32& userPolicy, HttpApplication httpApplication, IRecipientSession recipientSession, IConfigurationSession configSession, ConfigWrapper config)
   at Microsoft.Exchange.Security.Authentication.FederatedAuthService.BasicAuthPolicyEvaluator.IsBasicAuthAllowed(String userKey, String protocolName, Int32 traceId, HttpApplication httpApplication, IRecipientSession recipientSession, IConfigurationSession configSession, ConfigWrapper config)
   at Microsoft.Exchange.HttpProxy.ProxyModule.IsLegacyAuthAllowed(HttpApplication httpApplication)
   at Microsoft.Exchange.HttpProxy.ProxyModule.OnPostAuthenticateInternal(HttpApplication httpApplication)
   at Microsoft.Exchange.Common.IL.ILUtil.DoTryFilterCatch(Action tryDelegate, Func`2 filterDelegate, Action`1 catchDelegate)
Log Name:      Application
Source:        ASP.NET 4.0.30319.0
Event ID:      1309
Task Category: Web Event
Level:         Warning
Keywords:      Classic
User:          N/A
Description:
Event code: 3005 
Event message: An unhandled exception has occurred. 
Event ID: f227e48f4d474111873957cee43f41ff 
Event sequence: 2 
Event occurrence: 1 
Event detail code: 0 
 
Application information: 
    Application domain: /LM/W3SVC/2/ROOT/EWS-2-132518667113847979 
    Trust level: Full 
    Application Virtual Path: /EWS 
    Application Path: C:\Program Files\Microsoft\Exchange Server\V15\ClientAccess\exchweb\EWS\ 
 
Process information: 
    Process ID: 12016 
    Process name: w3wp.exe 
    Account name: NT AUTHORITY\SYSTEM 
 
Exception information: 
    Exception type: DuplicateKeyException 
    Exception message: Cannot add a duplicate key.  Use Insert instead
Log Name:      Application
Source:        MSExchange Autodiscover
Event ID:      1
Task Category: Web
Level:         Error
Keywords:      Classic
User:          N/A
Description:
Unhandled Exception "Cannot add a duplicate key.  Use Insert instead"
Stack trace:    at Microsoft.Exchange.Security.Authentication.FederatedAuthService.CacheReader.AddEntry(String userKey, Int32 userPolicy, ConfigWrapper config)
   at Microsoft.Exchange.Security.Authentication.FederatedAuthService.BasicAuthPolicyRepo.GetUserPolicy(String userKey, Int32 traceId, Int32& userPolicy, HttpApplication httpApplication, IRecipientSession recipientSession, IConfigurationSession configSession, ConfigWrapper config)
   at Microsoft.Exchange.Security.Authentication.FederatedAuthService.BasicAuthPolicyEvaluator.IsBasicAuthAllowed(String userKey, String protocolName, Int32 traceId, HttpApplication httpApplication, IRecipientSession recipientSession, IConfigurationSession configSession, ConfigWrapper config)
   at Microsoft.Exchange.Security.Authentication.BackendRehydrationModule.IsLegacyAuthAllowed(HttpContext httpContext)
   at Microsoft.Exchange.Security.Authentication.BackendRehydrationModule.OnAuthenticateRequest(Object source, EventArgs args)
   at System.Web.HttpApplication.SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
   at System.Web.HttpApplication.ExecuteStepImpl(IExecutionStep step)
   at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)

Microsoft needs to fix this. Unhandled exceptions are not good. They will cause the app pools to recycle. Microsoft support's recommended fix is to move your mailboxes to the cloud 😒


Thursday, December 3, 2020

Azure Application Proxy Woes / Searching Azure AD for SPNs

I was trying to setup Azure Application Proxy so that it could I could publish my onprem Exchange OWA access through it and I was running into problems. Every time I tried to create the AAP on-premise application it would error and not create it. The error was not descriptive; it just said "Failed to create on premises application". This status would go into the notification but there was not more info. 

I messed around and determined that it didn't like the External URL that I was using. If I used a different External URL it would work. Just not the one I wanted to use. I was confused as I did not think I was using that URL anywhere.

After repeatedly clicking "Add" (and failing) a bunch of times, with the External URL I wanted, I checked the notifications and one of them had a hyperlink that provided a little more info as to what was going on:

{"errorCode":"Request_BadRequest","localizedErrorDetails":{"errorDetail":"A conflicting object with one or more of the specified property values is present in the directory."},"operationResults":null,"timeStampUtc":"","clientRequestId":"","internalTransactionId":"","tenantId":"","userObjectId":"","exceptionType":"AADGraphException"}

This provided a little more insight. I didn't think I had anything using that URL though.

Since this looked to be directory related, I checked the Azure Active Directory Audit Logs. Under there I was seeing Update Service Principal Failure and Update Application Failure. 

So I decided to search Azure AD's Service Principal Names for the URL that I was trying to use and sure enough, that URL was registered. I removed that SPN and I was then able to add the Azure Application Proxy app with the External URL I wanted. Turns out that when I was setting up SPNs for Hybrid Modern Authentication (HMA) I added an SPN that I didn't mean to.

If you're having a similar issue and you want to search your SPNs here is the PowerShell to do so:

First connect:
Connect-AzureAD

If you want to see all the data in a grid view:
Get-AzureADServicePrincipal | select * | Out-GridView

If you want to get a list of SPNs with the corresponding ObjectId and DisplayName:
Get-AzureADServicePrincipal | select -Property ObjectID, DisplayName -ExpandProperty ServicePrincipalNames | select @{Name='SPN';Expression={$_}}, ObjectID, DisplayName

You can then search this output for the URL you're looking for or you can search the SPNs with the following and return only the SPNs you're looking for:
Get-AzureADServicePrincipal | select -Property ObjectID, DisplayName -ExpandProperty ServicePrincipalNames | select @{Name='SPN';Expression={$_}}, ObjectID, DisplayName | where SPN -Like '*mail*'

Tuesday, October 9, 2018

Insightful WMI Snippets


Get all the namespaces
function Get-CimNamespaces ($ns="root") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Get-CimNamespaces $("$ns\" + $_.Name)
   }
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns
}
Get-CimNamespaces 

Get-CimNamespaces | select Name, @{n='NameSpace';e={$_.CimSystemProperties.NameSpace}}, CimClass

View all the namespaces in a hierachy
function View-CimNamespacesHierarchy ($ns="root", $c="|") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Write-Host $c $_.Name
       View-CimNamespacesHierarchy $("$ns\" + $_.Name) $c"-"
   }
}
View-CimNamespacesHierarchy


Get all the WMI providers on the system
function Get-CimProvider ($ns="root") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Get-CimProvider $("$ns\" + $_.Name)
   }
   Get-CimInstance -NameSpace $ns -Class __Win32Provider
}
Get-CimProvider

Get-CimProvider | select Name, @{n='NameSpace';e={$_.CimSystemProperties.NameSpace}}, CimClass

Get-CimProvider | Measure-Object
=> 188

Get all the provider registrations. A provider can be registered multiple times w/ different registration types.
function Get-CimProviderRegistration ($ns="root") {
   Get-CimInstance -ClassName __NAMESPACE -Namespace $ns |
   foreach {
       Get-CimProviderRegistration $("$ns\" + $_.Name)
   }
   Get-CimInstance -NameSpace $ns -Class __ProviderRegistration
}
Get-CimProviderRegistration

Get-CimProviderRegistration | Measure-Object
=> 303

Get-CimProviderRegistration | Group-Object {$_.CimSystemProperties.ClassName} | select count, name

=>
Count Name                              
----- ----                              
   11 __EventConsumerProviderRegistration
  135 __InstanceProviderRegistration    
    2 __PropertyProviderRegistration    
  115 __MethodProviderRegistration      
   34 __EventProviderRegistration       
    6 __ClassProviderRegistration       

Wednesday, September 19, 2018

WMI root\cimv2 Hierarchy Visualization

Update: I made a mobile friendly version. You can find it at http://www.kreel.com/wmi. If you add it to your home screen and then open it, it should open full screen and give you more real-estate (at least on the iphone.)

While digging through the room\cimv2 WMI namespace I wanted to visually see of all the parent child relationships. So I put together a quick visualization.

You can see it here: http://www.kreel.com/wmi_hierarchy.html



It's just a tree starting with all the classes that do not inherent from any parent. I created a "no parent" node that all the classes without a parent fall under. This just made it simpler/quicker to get the visualization done.

The page is not mobile optimized.

Pan around with the mouse.

Zoom in and out with the mouse's scroll wheel

Search for a specific WMI class in the root\cimv2 namespace. The exact name is needed for the search to work. Enter the name, click "go" and it will take you to the class in the visualization. The search does not work with partial names. It is case-insensitive though and there is an autocomplete that should get you the class you're looking for. (The auto-complete population does do partial names.)

The code isn't the prettiest, I put it together really quick. There is much to be improved. I thought about modifying it to show all the associations between the classes...

Tuesday, July 31, 2018

Hierarchical View of S2D Storage Fault Domains

You can view your storage fault domains by running the following command:
Get-StorageFaultDomain

The problem is that this is just a flat view of everything in your S2D cluster.

If you want to view specific fault domains you can use the -Type parameter on Get-StorageFaultDomain.

For example, to view all the nodes you can run:
Get-StorageFaultDomain -Type StorageScaleUnit

The options for the type parameter are as follows:

StorageSite
StorageRack
StorageChassis
StorageScaleUnit
StorageEnclosure
PhysicalDisk

This is in order from broadest to most specific. Most are self-explanatory.

  • Site represents different physical locations/datacenters.
  • Rack represents different racks in the datacenter.
  • Chassis isn't obvious at first. It's only used if you have blade servers and it represents the chassis that all the blade servers go into.
  • ScaleUnit is your node or your server.
  • Enclosure is if you have multiple backplanes or storage daisy chained to your server.
  • Disk is each physical the disk.

The default fault domain awareness is the storage scale unit. So data is distributed and made resilient across nodes. If you have multiple blade enclosures, racks or datacenters you can change this so that you can withstand the failure of any one of those things.

I'm not sure if you can or would want to change the fault domain awareness to StorageEnclosure or PhysicalDisk?

These fault domains are hierarchical. Disks belongs to an enclosure, an enclosure belongs to node (StorageScaleUnit), a node belongs to a chassis, a chassis belongs to rack and a racks belongs to site.

Since most people are just using node fault domains I made the following script to show your fault domains in a hierarchical layout beneath StorageScaleUnit. The operational status for each fault domain is included.

Get-StorageFaultDomain -Type StorageScaleUnit | %{Write-Host $Tab $Tab $_.FriendlyName - $_.OperationalStatus;$_ | Get-StorageFaultDomain -Type StorageEnclosure | %{Write-Host $Tab $Tab $Tab $Tab  $_.UniqueID - $_.FriendlyName  - $_.OperationalStatus;$_ | Get-StorageFaultDomain -Type PhysicalDisk | %{ Write-Host $Tab $Tab $Tab $Tab $Tab $Tab $_.SerialNumber - $_.FriendlyName - $_.PhysicalLocation - $_.OperationalStatus} } }

You could easily modify this to add another level if you utilize chassis, rack or site fault domain awareness.

Tuesday, July 10, 2018

The Proper Way to Take a Storage Spaces Direct Server Offline for Maintenance?

Way back in September 2017 a Microsoft update changed the behavior of what happened when you suspended(paused)/resumed a node from a S2D cluster. I don't think the article "Taking a Storage Spaces Direct Server offline for maintenance" has been updated to reflect this change?

Previous to the September 2017 update, when you suspended a node, either view powershell Suspend-ClusterNode or via the Failover Cluster Manager GUI Pause option, the operation would put all the disks on that node in maintenance mode. Then when you resumed the node, the resume operation would take the disks out of maintenance mode.

The current suspend/resume logic does nothing w/ the disks. If you suspend a node it's disks don't go into maintenance mode and if you resume the node nothing is done to the disks.

I postulate what you need to do after you suspend/drain the node and prior to shutting it down or restarting it is put the disks for that node into maintenance mode. This can be done with the following powershell command:

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "node1"} | Enable-StorageMaintenanceMode

Be sure to change "node1" to the name of the node you've suspended in the above powershell snippet.

When the node is powered on/rebooted, prior to resuming the node, you need to take the disks for that node out of maintenance mode. Which can be done with the following command:

Get-StorageFaultDomain -Type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "node1"} | Disable-StorageMaintenanceMode

The reason I'm thinking this should be done is that if you just reboot the node without putting the disk in maintenance mode then the cluster will behave as if it lost that node. Things will recover eventually but timeouts may occur and if you're system is extremely busy with IO bad things could happen (VMs rebooting, CSVs moving, etc.) I'm thinking it's better to put the disks for the node in maintenance mode so all the other nodes know what's going on and the recovery logic doesn't need to kick in. Think of it this way, it's better to tell all the nodes what's going on then to make them have to figure out what's going on... I need to test this theory some more...

Update: It looks like the May 8th 2018 update (KB103723) "introduced SMB Resilient Handles for the S2D intra-cluster network to improve resiliency to transient network failures. This had some side effects in increased timeouts when a node is rebooted, which can effect a system under load. Symptoms include event ID 5120’s with a status code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a node is rebooted." The above procedure is the work around until a fix is available.

Another symptom, at least for me, was connections to a guest SQL cluster timing out. When a node was rebooted prior to the May update everything was fine. After applying the May update and rebooting a node, SQL time outs would occur.

Update: MS released an article and for the time being you should put the disks in maintenance mode prior to rebooting it would seem. Also you might want to disable live dumps it would appear.

Update: Use $Env:ComputerName and run on the computer you want to perform maintenance on instead of specifying the node name:

Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq "$($Env:ComputerName)"}
| Enable-StorageMaintenanceMode
| Disable-StorageMaintenanceMode


Friday, June 29, 2018

View Physical Disks by Node in S2D Cluster

Quick snippets to view physical disks by node in S2D cluster:

Get-StorageNode |%{$_.Name;$_ | Get-PhysicalDisk -PhysicallyConnected}

The following is useful if you're looking at performance monitor and you're trying to figure out which device number is which:

gwmi -Namespace root\wmi ClusPortDeviceInformation | sort ConnectedNode,ConnectedNodeDeviceNumber,ProductId | ft ConnectedNode,ConnectedNodeDeviceNumber,ProductId,SerialNumber