Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Expected behavior of parallel Strategy for MultiSubnetFailover option #1552

Open
ml-rex opened this issue Jul 11, 2023 · 5 comments
Labels
Q&A For non-issues. General Q&A

Comments

@ml-rex
Copy link

ml-rex commented Jul 11, 2023

Question

I have an application that connect to a SQL Server Multi-Subnet Cluster with two subnets (primary and DR subnets).
With this setup, the DR is in offline state while the primary is the active one. Also, we have setup the Availability Group Listener with a DNS record round-robin the two subnet IP addresses. The application is using TypeORM with mssql driver, which use tedious.

https://learn.microsoft.com/en-us/sql/sql-server/failover-clusters/windows/sql-server-multi-subnet-clustering-sql-server?view=sql-server-ver16

As suggested in the link, we added the multiSubnetFailover: true option to the connection config and we expect tedious will only create connection to the database nodes in the active primary subnet. However, sometimes we receive the error message: "ConnectionError: Connection lost - read ECONNRESET".

After deliberate effort of investigation, we see that a pattern that tedious was connecting to the offline DR IP when this error happens. This is out of my expectation since the offline IP is supposed to fail the pool validation check and should not be created in the connection pool.

Looking deep into the source code of tedious with debug tool, i can confirm that the ParallelConnectionStrategy was being used when the multiSubnetFailover option is provided. And apparently the TCP connection was established successfully for the offline IP but later on the connection will emit an error. I added some console log to visualize what happened:

ParallelConnectionStrategy Addresses: [{"address":"DR_IP", "family": 4}, {"address": "PRIMARY_IP", "family": 4}]
Connecting {"address": "DR_IP", "family": 4}
Connecting {"address": "PRIMARY_IP", "family": 4}
onConnect: "DR_IP"
Sending Pre Login: "DR_IP"
onError: "DR_IP"
socketError this.state: {"name": "SentPrelogin", "events": {}} socketError error: {"errno":-104, "code": "ECONNRESET", "syscall": "read" }
2023-06-10 13:23:53 TypeormDatabaseLogger info: warn
"MSSQL pool raised an error. ConnectionError: Connection lost - read ECONNRESET"
2023-06-10 13:23:53 TypeormDatabaseLogger error: QueryFailedError: ConnectionError: Connection lost - read ECONNRESET
"trace": {
  "Query": "select 1"
}

My questions are:

  1. Am i missing some config options?
  2. When multiSubnetFailover is set to true, does tedious check and reject an offline subnet IP?
  3. If not, does all applications using tedious need to implement this check themselves?

Versions:
Typeorm: 0.3.12
Mssql: 7.3.0
Tedious: ^11.4.0

Config

{
    name: 'MY_DB_CONN',
    type: 'mssql',
    host: config.hostUrl,
    port: config.port,
    username: config.username,
    password: config.password,
    database: config.database,
    keepConnectionAlive: false,
    requestTimeout: 3 * 15000,
    pool: {
      min: 1,
      max: 10,
    },
    options: {
      encrypt: true,
      trustServerCertificate: true,
      multiSubnetFailover: true,
      useUTC: true,
    },
}

Relevant Issues and Pull Requests

@ml-rex ml-rex added the Q&A For non-issues. General Q&A label Jul 11, 2023
@Malcolm-Stewart
Copy link

Malcolm-Stewart commented Jul 11, 2023

I was using a different driver, but the symptom sounds similar.

I have seen this issue when a smart device, such as F5 or similar, answers the SYN packet going to the secondary subnet. In this case, the driver will get fooled as to which subnet connection has the active node and will attempt to connect to the inactive node. However, once the PreLogin packet is emitted, the device tries to contact the back-end database and fails.

The case I had was intermittent and the F5 device was configured to detect SYN attacks and once the number of SYN packets in the inactive subnet reached a certain threshold, it would start answering them, and then, later, it would stop answering them for a while.

I was able to replicate it with TELNET and using the inactive IP address. For a few minutes, it would die a normal death and then for another few minutes, it would open up as if it was connected to the back-end. You can see the response packets in a network trace.

@MichaelSun90
Copy link
Contributor

Hi @ml-rex , Thanks for raising this and the detailed explanation. As for you question, I will try my best to answer them:
Am I missing some config options?
I do not think this is any additional config related to this.

When multiSubnetFailover is set to true, does tedious check and reject an offline subnet IP?
What current inside tedious is, when mutisubnetfailover is set to true, tedious will from connections in parallel for all address that returned by dns.lookup. From behavior that you explained, seems this dns.lookup will return all the address that associate to the host no matter what their status. I tried but failed find anything concrete that explained whether IP status mattes for this function's returned addresses. On tedious side, the logic will try to connect to all the address returned, the failed the connection to offline IP hence the returned socket error.

If not, does all applications using tedious need to implement this check themselves?
Unfortunately, current tedious logic can only fail the connection after try to connected it, and reject it if there is a socket error. We can definitely do some investigation, see if there is possibility to filter out IP address by their status, and simplified this process.

Hi @arthurschreiber , am I correct about the dns.lookup returns all the address no matter of their online/ offline status? Do you aware of any way that we can look up the address but filter out the offline IPs?

@ml-rex
Copy link
Author

ml-rex commented Jul 12, 2023

I was using a different driver, but the symptom sounds similar.

I have seen this issue when a smart device, such as F5 or similar, answers the SYN packet going to the secondary subnet. In this case, the driver will get fooled as to which subnet connection has the active node and will attempt to connect to the inactive node. However, once the PreLogin packet is emitted, the device tries to contact the back-end database and fails.

The case I had was intermittent and the F5 device was configured to detect SYN attacks and once the number of SYN packets in the inactive subnet reached a certain threshold, it would start answering them, and then, later, it would stop answering them for a while.

I was able to replicate it with TELNET and using the inactive IP address. For a few minutes, it would die a normal death and then for another few minutes, it would open up as if it was connected to the back-end. You can see the response packets in a network trace.

Thanks for the sharing.
From what you mentioned, seems like we have to handle the inactive node connection check on top of the tedious package.

Can you also share what driver you were using when you experience the issue?

@ml-rex
Copy link
Author

ml-rex commented Jul 12, 2023

Hi @ml-rex , Thanks for raising this and the detailed explanation. As for you question, I will try my best to answer them: Am I missing some config options? I do not think this is any additional config related to this.

When multiSubnetFailover is set to true, does tedious check and reject an offline subnet IP? What current inside tedious is, when mutisubnetfailover is set to true, tedious will from connections in parallel for all address that returned by dns.lookup. From behavior that you explained, seems this dns.lookup will return all the address that associate to the host no matter what their status. I tried but failed find anything concrete that explained whether IP status mattes for this function's returned addresses. On tedious side, the logic will try to connect to all the address returned, the failed the connection to offline IP hence the returned socket error.

If not, does all applications using tedious need to implement this check themselves? Unfortunately, current tedious logic can only fail the connection after try to connected it, and reject it if there is a socket error. We can definitely do some investigation, see if there is possibility to filter out IP address by their status, and simplified this process.

Hi @arthurschreiber , am I correct about the dns.lookup returns all the address no matter of their online/ offline status? Do you aware of any way that we can look up the address but filter out the offline IPs?

Thanks Michael for the answer.
Sounds like what i observed is the expected behavior of tedious design.

To supplement,
In my case, i do see the node-mssql driver is doing another connection validation before pushing into the connection pool.
https://github.com/tediousjs/node-mssql/blob/7248e58ff223b2369cb1570005d54e9196c904bf/lib/base/connection-pool.js#L379
https://github.com/tediousjs/node-mssql/blob/7248e58ff223b2369cb1570005d54e9196c904bf/lib/tedious/connection-pool.js#L104
However, i'm still unsure why an unhealthy connection was being picked to handle a request.

@Malcolm-Stewart
Copy link

Hi @ml-rex, the driver does not matter. In my case it was the .NET SqlClient driver. The way that multi-subnet works is that the DNS request will return multiple IP addresses in a random order. The primary server maps the IP address for its subnet to the MAC address of its NIC card. The secondary releases its IP address so it is not connected to anything. If the driver connected to the secondary IP address first, e.g. when not using Mulitsubnet failover, then it would normally take 21 seconds to get an error from the network. MSF overcomes this by connecting to both/all IP addresses in parallel and assumes the primary will respond in a few ms and the secondary won't respond but will error out later. Once a response is made, it cancels the other connection attempt and uses the first connection. This works really well. But in the case I experienced, a network device thwarted the connection assumptions. It's generally better to identify and remove the device doing this rather than try to predict which IP address should be connected to. Your code potentially could be subject to the same "spoofing" from the device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Q&A For non-issues. General Q&A
Projects
None yet
Development

No branches or pull requests

3 participants