Post | John Louros

Leveraging Multi-Subnet failover

2015-12-04 SQL Server 02:02 John Louros Leveraging Multi-Subnet failover Basic understanding of how MultiSubnetFailover can be used to reduce the down time of a database failure.

MSDN has a ton of information about this topic, from database setup to SqlClient configuration, however if you are fairly new to this topic you might get overwhelmed with all the available information. My goal for this blog post is to simply the idea behind this concept so anybody can understand the basics. Just keep in mind that the Network setup can be far more complex, but the fundamentals will be the same.

Starting with the definition, MultiSubnetFailover is a .Net framework feature for [System.Data].SqlClient (SQL Server data provider) that enhances how SqlClient interacts with an AlwaysOn Availability Group, providing faster detection of and connection to the (currently) active server.

In order to keep things simple, the example I'm going to use consists of 1 Web Application server, 1 DNS server, and 2 databases with data replication setup between then (DataSync on the diagram). In this case, it's assumed the Web Application can pick any of the available databases, given that data replication will ensure data inserted, updated or deleted to one database, is properly reflected to the other. There's way more to database replication than this, for now just keep in mind that doesn't matter what database the web application picks. Diagram representation below: Multi-subnet-failover test environment diagram On the DNS server, a new hostname will be defined 'msfqamockdb' that will point to '10.200.200.89' (msfm-db1) and '10.200.200.239' (msfm-db2). Doing 'nslookup' on 'msfqamock' should return the following: Multi-subnet-failover environment, nslookup If both servers are available, no problem, the web application can pick whatever it's preferred. If I'm not wrong, it will pick the first result of the list of IPs assigned to the given hostname. For the sake of argument, let's say '10.200.200.89' database is picked. While the database is up and running, everything will be OK, but imagine that someone unexpectedly pulls the plug on that database. By default (with MultiSubnetFailover disabled) the SqlClient will get the list of IP address and serially tries to re-connect to each of them. Each connection attempt timeout is 21 seconds (by default) and it will be performed serially in order. With MultiSubnetFailover the connection attempts will be performed in parallel (as close to parallel as possible) without waiting for TCP ACK (acknowledgment), the first server to respond will be picked to establish the connection, resulting in much faster reconnect times.

MultiSubnetFailover has a set of limitations that you must be aware:

Hostnames with more than 64 IP addresses is not supported
Only supported using TCP protocol
Doesn't support SQL named instances
Connecting to a mirrored SQL Server instance won't work

Additionally, as a best practice, try to keep your SQL operations as atomic as possible and avoid leaving SQL connections open for a long time, mostly because MultiSubnetFailover will only do his magic during [System.Data.SqlClient].SqlConnection.Open(). Also, do not except 100% availability. Even performing the simplest query, if the database goes down while a connection is open, you will get an SQL exception.

To enable this feature, simply append 'MultisubnetFailover=true' to a connection string (usually found in either 'App.config' or 'Web.config'). As mentioned before, connecting to a mirrored SQL Server instance won't work, so make sure the connection string does not have 'Failover Partner' set. If you are using .Net framework 4.6.1, you don't need to do anything, since SqlClient transparently detects whether your application is connecting to an AlwaysOn Availability Group. You can read more about it here

If you need proof that this feature properly works, feel free to download my test application from my GitHub. It contains a .Net 4.5 and .Net 4.6.1 versions, but the use exactly the same code. The goal was to test how MultiSubnetFailover behaves in both framework versions.

To know more about this features, here's a couple of useful references: