Tuesday, September 27, 2011

Shadow Redundancy and Server Outages

Exchange Server 2010 has a feature that tries to ensure that emails in transport cannot be lost. This feature is called Shadow Redundancy and lots of information on how it works can be found on the Internet.

But what happens if a mailbox server or site is unavailable? Items will queue in a single location, and now this location is a single point of failure. So whilst you have an outage (planned or otherwise), you increase the risk of loss of mail due to a second outage in transport that causes mail.que database corruption.

Let us examine the details by considering one type of outage – other types of problems can occur and generate the same potential results. If I dismount a database in an Active Directory site and then send an email to an Exchange 2010 mailbox on that site the email will queue on an Exchange 2010 Hub Transport server in that site. The queue will be visible with Get-Queue and the queue will go into Retry state. Here is a picture showing the Exchange Management Shell output for one such site:

The first cmdlet shows one email queuing for a mailbox database (the one that is offline), the DeliveryType is MapiDelivery and the NextHopDomain (the next target) is the offline database.

image

The second cmdlet in the above picture shows the effect of a second email being sent. The items in the queue are at 2, and both of these are on FAB-RED-HUB1. Should FAB-RED-HUB1 fail at this point and the mail.que database become corrupt due to this failure, these emails would be lost.

What you cannot see from the screenshot is the effect of Delayed Acknowledgement. Delayed Acknowledgement is the process whereby if a Hub Transport server receives an email from an SMTP server that does not support Shadow Redundancy then it will delay acknowledgement to the message long enough to ensure the message exists on two servers – that is, it has a shadow for the message. In the above example this is not possible as the inbound email is from the internet and is directly into this Active Directory site, so there is nowhere else to send the email. Delayed Acknowledgement is set to 30 seconds by default and so on the arrival of the first message the sending server has their acknowledgement delayed by the full 30 seconds. On the arrival of the second message, as the delivery queue is in retry the Delayed Acknowledgement is not implemented as DelayedAckSkippingEnabled is set to True by default (so if it would take over 30 seconds to deliver or the target queue is in retry then don’t implement a delay as it is likely to be present even after 30 seconds. The problem here is that protection of the first message was 30 seconds, and if the mailbox database (or other failure) was resolved in 30 seconds then you would have delayed the acknowledgement and so protected the message by having not told the previous hop that it was queued. The second message (and all subsequent messages) are immediately added to the outbound queue and are a single point of failure.

Service Pack 1 for Exchange 2010 adds Shadow Redundancy Promotion. This will ensure that the message lives on two transport servers within a site if the NextHopDomain is unavailable. But this is disabled by default.

To enable Shadow Redundancy Promotion, edit the EdgeTransport.exe.config file on all hub transport servers to read True for the ShadowRedundancyPromotionEnabled setting. Once the EdgeTransport.exe.config file is saved then restart the Microsoft Exchange Transport service on all servers. EdgeTransport.exe.config is found in \Program Files\Microsoft\Exchange Server\V14\bin.

This second screenshot shows the effect of enabling Shadow Redundancy Promotion on all my hub transport servers and restarting the transport service on each machine. The screenshot follows on immediately from the above example.

image

In the above you can now see that the queue that did contain the message to the mailbox database (FAB-RED-HUB1\216) is now empty and that their is a shadow queue containing the two messages on FAB-RED-HUB1 instead. FAB-RED-HUB2 (also in the same site) now hold the queue to the offline database. In the event of a transport server failure whilst the database is offline, there will not be a loss of email as the email can be redelivered from the other transport server.

Monday, September 19, 2011

How to Speed Up Hub Transport Server Selection

Install Exchange 2010 SP1!

Installing the service pack fixes the round-robin selection process for remote hub transport servers in other sites (see Hub Transport Load Balancing) so that only the IP addresses of operational servers are used.

Exchange 2010 runs on Windows 2008 (or 2008 R2) and this operating system supports IPv4 and IPv6. In fact it will provide an IPv6 address even if you don’t have an IPv6 router or infrastructure. But if a remote hub registers an IPv6 address in DNS then you might attempt to use the IPv6 address before any IPv4 address and fail to connect. Exchange 2010 SP1 will now remove the IPv6 address from the round-robin hub selection list, and so speed up transport in Exchange.

Formatting Get-ExchangeDiagnosticInfo

For the last blog post for today, this one looks at formatting the output of Get-ExchangeDiagnosticInfo as the XML that this cmdlet returns can be quite long. For example if you want to see if your server is in backpressure then you need to view the output of the ResourceMonitor component, but as this contains historical information on backpressure events its more than enough data to overfill the Exchange Management Shell screen!

So rather than running Get-ExchangeDiagnosticInfo -Process EdgeTransport -Component ResourceManager -Argument verbose you can do the following

[xml]$diag = Get-ExchangeDiagnosticInfo –Server xxx –Process EdgeTransport –Component ResourceManager –Argument verbose
$diag.Diagnostics.Components.ResourceManager.CurrentComponentStates
$diag.Diagnostics.Components.ResourceManager.ResourceMonitors.ResourceMonitor | FT –a Type, Resourc*,*Pressure*



This will collect the ResourceManager component from Get-ExchangeDiagnostics and place it in a variable called $diag. The second line returns the state of each backpressure monitor and if it is enabled and the third line displays the results of backpressure on your server (currentpressure) along with the values at which you will change from low to medium (the mediumPressureLimit) or from medium to high (the highPressureLimit) or from medium to low (the lowPressureLimit). For an example see the below picture, from my lab environment, which is currently busy doing nothing!



image

Getting Exchange 2010 SP1 Diagnostics

New with Exchange Server 2010 SP1 is the Get-ExchangeDiagnosticInfo PowerShell cmdlet. This is not documented anywhere online, so I thought I would start a trend! 

Get-ExchangeDiagnosticInfo reports information on the information and status of Exchange Server as seen by individual processes. The information returned is in the form of a blob of XML data and in my next blog I will show how to process that information into a more readable form

At the time of writing Get-ExchangeDiagnosticInfo only reports information on the Mailbox Server role and the Hub Transport role, and only for the sending and receiving of email – so its currently an exclusive cmdlet for Exchange Transport. 

From an Exchange Management Shell window, start with just Get-ExchangeDiagnosticInfo to get a list of processes on the machine that can be queried. The Result value reports something like this:
<Diagnostics>
<ProcessLocator>
<count>2</count>
<Process>
<guid></guid>
<id>2356</id>
<name>MSExchangeMailSubmission</name>
</Process>
<Process>
<guid></guid>
<id>3408</id>
<name>EdgeTransport</name>
</Process>
</ProcessLocator>
</Diagnostics>



Of most interest from this output are the two processes that you can get diagnostics from. These two are the Mail Submission Service and the Edge Transport process. Mail Submission’s job is to notify any Hub Transport role in the same Active Directory site that it has email waiting to be collected and the EdgeTransport.exe process does all the work of the collecting, processing and delivering emails onward.





To report further information on these processes use the Process parameter with Get-ExchangeDiagnosticInfo




Get-ExchangeDiagnosticInfo –Process EdgeTransport




or


Get-ExchangeDiagnosticInfo –Process MSExchangeMailSubmission

For this level of reporting you get even more XML returned, but the interesting XML data to look at now is the Components group. EdgeTransport returns TransportConfiguration, ResourceManager, TransportDumpster, RmsClientManager, ShadowRedundancy, SmtpOut and StoreDriver. The MSExchangeMailSubmission process allows the querying of the MailSubmission and RedundancyManager.





Both of these two cmdlets above also return the "Supported arguments” data (part of the Help XML blob). These values can be used on the command line as the –Argument parameter.





To query the diagnostics of an individual component use the following syntax:




Get-ExchangeDiagnosticInfo -Process EdgeTransport -Component SmtpOut –Argument help

The above for example shows the state of the process and that verbose is a supported argument. Other components have other arguments. For example, try verbose for SmtpOut




Get-ExchangeDiagnosticInfo -Process EdgeTransport -Component SmtpOut –Argument verbose

This shows you which hub transport servers and smarthosts are operational and reachable, along with the Configuration data which shows how often Exchange will check them to see if they are back online again.





Some of the output from this cmdlet returns a large amount of data. One such example of this is the ResourceManager component which returns the historical data for backpressure events on the server. Backpressure is when the server runs low or out of resources and so throttles or rejects first anonymous connections to the server and if resource utilisation does not return then goes into blocking all connections to see if resource utilisation can improve, and if it does, allowing connections back on again.


Hub Transport Load Balancing

In Exchange 2010 (not SP1) and Exchange 2007 there was no memory of unavailable transport servers and so the round robin method of load balancing across the hubs in the target delivery site or smarthosts used by connectors sourced to your current server was just that – round robin.

Though if a server was unavailable the next server in the list was selected and connected to and the first server in the list was moved to the end of the list of servers to use. This resulted in an uneven distribution of load when servers were offline. Imagine the scenario where you have three hub transports in the London Active Directory site (HL1, HL2 and HL3) which were installed in that order. A Hub Transport server in another AD Site will deliver up to 20 messages per connection and will make the connections in a round robin fashion. Therefore if HL1 is offline the connection will automatically be made to HL2. Upon completing the connection the first server in the list will be moved to the end of the list – in this example HL1 will move to the back of the list.

The next connection to the London site will use the list HL2, HL3, HL1 for delivery, and as HL2 is running will connect to HL2 and deliver its email and move HL2 to the back of the list. The third connect will go to HL3. The fourth connection will attempt to reach HL1 and fail, so deliver to HL2 and move HL1 to the back of the list.

The result of this is that HL2 will get 66% of email delivered to HL3’s 33% and not a 50/50 distribution once one server is down. When all servers in the site are operational the distribution will be 1/3 of connections each and even load balancing.

Exchange 2010 SP1 records downed servers in a separate list which it will attempt to connect to on a separate sequence (unrelated to email delivery). So taking the above example and HL1 is offline (again) and the source server is Exchange 2010 SP1 it will fail to connect and deliver to HL2, move HL2 to the bottom of the list and remove HL1 from the available servers list. Therefore HL2 and HL3 will get 50% of connections each – no overloading of the next hub in install order.

The source Exchange 2010 SP1 server will maintain this list of unavailable servers and will attempt to connect to the unavailable server regularly. It does this once a minute for four minutes (known as the QueueGlitchRetryCount and  QueueGlitchRetryInterval), then it changes to TransientFailureRetryCount and TransientFailureRetryInterval, which is six times, once every five minutes. After 35 minutes going through the Glitch and Transient retry intervals Exchange will only attempt to connect once every 10 minutes (the OutboundConnectionFailureRetryInterval value) or 15 minutes if on an Edge Transport server.

Once the server is online again it is added back into the round-robin load-balancing list for connections to remote sites or smarthost endpoints. This does mean though that if a server is offline for more than 35 minutes it will be up to 10 minutes before Exchange 2010 SP1 attempts to connect to it for transport and email delivery.

To see which servers are on your unavailable list run Get-ExchangeDiagnosticInfo -Process EdgeTransport -Component SmtpOut -Argument verbose . The Get-ExchangeDiagnosticInfo cmdlet is covered further in my next blog today.

Monday, September 05, 2011

How to Clear Password Policy on workstation after removing it from domain

I needed to set up a few machines for a client in an internet cafe type scenario, but the client provided me with computers that had been added to the domain. The domain had a password requirement which meant I could not configure the default login on the cafe machines to have no password.
To reset the domain policy without adding the computer back into the domain and actually changing the policy for a short while you can reset the security settings on an XP computer to that at install time using the following command:
 secedit /configure /cfg %windir%\repair\secsetup.inf /db secsetup.sdb /verbose
This resets lots of settings back to the default installation configuration, but meant that I did not need to reinstall the operating system.
For full details on the limitations of the above command see http://support.microsoft.com/default.aspx?scid=kb;en-us;313222