Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mirrored mode keep-alive not working #10741

Open
1 of 2 tasks
dorssel opened this issue Nov 12, 2023 · 4 comments
Open
1 of 2 tasks

mirrored mode keep-alive not working #10741

dorssel opened this issue Nov 12, 2023 · 4 comments
Labels

Comments

@dorssel
Copy link

dorssel commented Nov 12, 2023

Windows Version

Microsoft Windows [Version 10.0.22631.2506]

WSL Version

2.0.9.0

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.133.1-1

Distro Version

Ubuntu 22.04.3 LTS

Other Software

usbipd-win 3.2.0

Repro Steps

Attach a USB device with usbipd-win, and let the device be idle (no actual USB traffic).
After attaching a USB device that is otherwise idle, the connection gets closed after 20 seconds. This is exactly the total keep-alive timeout (including retries) that should keep the connection alive.

NOTE: This will happen for any socket that enables keep-alive on the Windows (host) side. Only for mirrored mode with connections to 127.0.0.1 (WSL to host over loopback0).

Expected Behavior

The keep-alive should work normally and the connection should not be dropped.

Actual Behavior

The connection gets closed after 20 seconds. This is exactly the total keep-alive timeout (including retries) that should keep the connection alive.

NOTE: This will happen for any socket that enables keep-alive on the Windows (host) side. It is not USBIP related per se, but that is the software that displays this behavior. Only for mirrored mode with connections to 127.0.0.1 (WSL to host over loopback0).

Diagnostic Logs

For a working setup (not mirror mode), a tcpdump -i eth0 "port 3240" shows correct behavior:

23:13:57.275975 IP 172.28.239.34.58078 > Seattle.3240: Flags [.], ack 1, win 501, options [nop,nop,TS val 392565456 ecr 37383378,nop,nop,sack 1 {0:1}], length 0
23:14:07.288404 IP Seattle.3240 > 172.28.239.34.58078: Flags [.], seq 0:1, ack 1, win 8193, length 1
23:14:07.288455 IP 172.28.239.34.58078 > Seattle.3240: Flags [.], ack 1, win 501, options [nop,nop,TS val 392575469 ecr 37383378,nop,nop,sack 1 {0:1}], length 0
23:14:17.291212 IP Seattle.3240 > 172.28.239.34.58078: Flags [.], seq 0:1, ack 1, win 8193, length 1
23:14:17.291248 IP 172.28.239.34.58078 > Seattle.3240: Flags [.], ack 1, win 501, options [nop,nop,TS val 392585471 ecr 37383378,nop,nop,sack 1 {0:1}], length 0
23:14:27.307626 IP Seattle.3240 > 172.28.239.34.58078: Flags [.], seq 0:1, ack 1, win 8193, length 1
23:14:27.307665 IP 172.28.239.34.58078 > Seattle.3240: Flags [.], ack 1, win 501, options [nop,nop,TS val 392595488 ecr 37383378,nop,nop,sack 1 {0:1}], length 0

Analysis: every 10 seconds a new keep-alive is sent, which is properly replied to (ACK).

For the bug (mirror mode, connection to 127.0.0.1), a tcpdump -i loopback0 "port 3240" shows:

13:26:29.344767 IP localhost.3240 > localhost.36440: Flags [.], seq 843210788:843210789, ack 1192587811, win 8193, length 1
13:26:29.344844 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:29.859170 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:29.859245 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:30.374486 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:30.374602 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:30.889009 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:30.889095 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:31.400310 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:31.400385 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:31.902468 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:31.902509 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:32.404331 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:32.404426 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:32.919185 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:32.919258 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:33.431602 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:33.431706 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:33.932492 IP localhost.3240 > localhost.36440: Flags [.], seq 0:1, ack 1, win 8193, length 1
13:26:33.932570 IP localhost.36440 > localhost.3240: Flags [R], seq 1192587811, win 0, length 0
13:26:34.433984 IP localhost.3240 > localhost.60710: Flags [R.], seq 2471, ack 569, win 0, length 0

Analysis:

  • The very first keep-alive request is already weird. It has length 1 it is not a "nop" (no-operation), which would normally be the case for keep-alive.
  • The reply is actually a "connection reset" (flag: R)
  • still, the host socket then retries the same weird request (length 1) every 500 ms, for 10 times. This is indeed what the host software configured for keep-alives (10 seconds delay, 10 retries, 500 ms apart).
  • Every time a connection reset is replied by WSL.
  • Only after the final retry, the host resets the connection, since it never got an ACK for the keep-alive it thought it was sending.
  • This appears to be a bug on the Windows side, where mirrored sockets send the wrong packet for keep-alives. WSL notices that the packet is wrong and resets the connection. Yet, the host does not actually ACK the reset until after it is done sending its corrupted keep-alives.

This may actually be 2 bugs: a) the keep-alive packet is wrong and b) once the socket is doing keep-alives (or so it thinks), it actually no longer honors connection resets until all keep-alives are sent.

Note that this is completely outside the control of any software; this is all kernel. The bug exists in the mirror socket on the Windows side.

@chanpreetdhanjal
Copy link

Could you please follow the steps below and attach the diagnostic logs? https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

@dorssel
Copy link
Author

dorssel commented Nov 26, 2023

@keith-horton
Copy link
Member

Hi @dorssel .

In your repro, I'm seeing the following TCP connections to a loopback listener on the host listening on 127.0.0.1:3240.

Can you please show what connections are problematic in-to/out-of the Linux container?

the first connection is just a quick connect/closesocket:
(3-way sync, 4-way fin). the local port for this connection is 60766.

1 16:57:41.300047 loopback0 Out IP 127.0.0.1.60766 > 127.0.0.1.3240: Flags [S], seq 802777829, win 64240, options [mss 1460,sackOK,TS val 2727195490 ecr 0,nop,wscale 7], length 0
2 16:57:41.300353 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60766: Flags [S.], seq 808561084, ack 802777830, win 65535, options [mss 1460,nop,wscale 8,sackOK,TS val 7511479 ecr 2727195490], length 0
3 16:57:41.300385 loopback0 Out IP 127.0.0.1.60766 > 127.0.0.1.3240: Flags [.], ack 1, win 502, options [nop,nop,TS val 2727195490 ecr 7511479], length 0
4 16:57:41.300426 loopback0 Out IP 127.0.0.1.60766 > 127.0.0.1.3240: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 2727195490 ecr 7511479], length 0
5 16:57:41.300655 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60766: Flags [.], ack 2, win 8195, options [nop,nop,TS val 7511479 ecr 2727195490], length 0
6 16:57:41.301064 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60766: Flags [F.], seq 1, ack 2, win 8195, options [nop,nop,TS val 7511480 ecr 2727195490], length 0
7 16:57:41.301087 loopback0 Out IP 127.0.0.1.60766 > 127.0.0.1.3240: Flags [.], ack 2, win 502, options [nop,nop,TS val 2727195491 ecr 7511480], length 0

The second connection shows a successful TCP connection established, data was pushed and ACKd, then the connection was terminated (this Linux client RST/terminated this connection). The local port for this connection is 60768:

8 16:57:41.495899 loopback0 Out IP 127.0.0.1.60768 > 127.0.0.1.3240: Flags [S], seq 124944460, win 64240, options [mss 1460,sackOK,TS val 2727195685 ecr 0,nop,wscale 7], length 0
9 16:57:41.496200 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60768: Flags [S.], seq 2213418618, ack 124944461, win 65535, options [mss 1460,nop,wscale 8,sackOK,TS val 7511675 ecr 2727195685], length 0
10 16:57:41.496230 loopback0 Out IP 127.0.0.1.60768 > 127.0.0.1.3240: Flags [.], ack 1, win 502, options [nop,nop,TS val 2727195686 ecr 7511675], length 0
11 16:57:41.496241 loopback0 Out IP 127.0.0.1.60768 > 127.0.0.1.3240: Flags [P.], seq 1:9, ack 1, win 502, options [nop,nop,TS val 2727195686 ecr 7511675], length 8
12 16:57:41.496245 loopback0 Out IP 127.0.0.1.60768 > 127.0.0.1.3240: Flags [P.], seq 9:41, ack 1, win 502, options [nop,nop,TS val 2727195686 ecr 7511675], length 32
13 16:57:41.496470 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60768: Flags [.], ack 9, win 8195, options [nop,nop,TS val 7511675 ecr 2727195686], length 0
14 16:57:41.496529 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60768: Flags [.], ack 41, win 8195, options [nop,nop,TS val 7511675 ecr 2727195686], length 0
<<<<<snip for brevity: lots of data transfer that's successfully ack'd >>>>>
55 16:57:43.405402 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60768: Flags [P.], seq 2393:2471, ack 569, win 8193, options [nop,nop,TS val 7513584 ecr 2727197588], length 78
56 16:57:43.454568 loopback0 Out IP 127.0.0.1.60768 > 127.0.0.1.3240: Flags [.], ack 2471, win 501, options [nop,nop,TS val 2727197644 ecr 7513584], length 0
77 16:57:58.577118 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.60768: Flags [R.], seq 2471, ack 569, win 0, length 0

The third connection shows a stale connection where the Linux client is not correctly processing the RST from the target, and it keeps trying to ack.

57 16:57:53.465887 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 2213421088:2213421089, ack 124945029, win 8193, length 1
58 16:57:53.465999 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
59 16:57:53.977016 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
60 16:57:53.977095 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
61 16:57:54.489141 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
62 16:57:54.489172 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
63 16:57:55.000189 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
64 16:57:55.000231 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
65 16:57:55.511250 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
66 16:57:55.511317 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
67 16:57:56.025376 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
68 16:57:56.025462 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
69 16:57:56.537479 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
70 16:57:56.537560 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
71 16:57:57.038081 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
72 16:57:57.038106 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
73 16:57:57.552446 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
74 16:57:57.552536 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0
75 16:57:58.062808 loopback0 In IP 127.0.0.1.3240 > 127.0.0.1.43959: Flags [.], seq 0:1, ack 1, win 8193, length 1
76 16:57:58.062888 loopback0 Out IP 127.0.0.1.43959 > 127.0.0.1.3240: Flags [R], seq 124945029, win 0, length 0

@dorssel
Copy link
Author

dorssel commented Feb 14, 2024

The second connection is problematic. Maybe I can explain how you can reproduce this with minimal effort:

  1. on the Windows host:
    • create a TCP socket() (any address, any port, say 0.0.0.0:9000)
    • configure keep-alive (say: 5 seconds delay, 10 retries, 1 second interval)
    • listen()
    • accept()
    • read()
  2. on the WSL mirrored-networking side:
    • connect() to TCP 127.0.0.1:9000
    • do nothing (sleep() forever)

Expected result:

  1. The connection is created (SYN-ACKed) by both sides, after which it is totally idle
  2. After 5 seconds, keep-alives should be sent by the Windows host every 5 seconds (and ACK-ed by WSL)
    (note: this is totally transparent for the 2 user mode programs, both Linux and Windows; this is all TCP stack kernel only)
  3. The connection stays alive forever
    (note: no actual data is ever sent, the read() will simply block forever)

Actual result:

  1. A broken packet (intended to be the first keep-alive) is sent after 5 seconds by the Windows host
  2. The Linux connection will error out upon reception of the first broken packet, it will send a TCP CONNECTION RESET
    (note: the program is still sleeping of course, but on the kernel level, the socket is terminated)
  3. The Windows host keeps sending broken keep-alives, 1 second apart, for 10 times
    (note: since WSL already terminated the socket, each of these is replied to with a TCP CONNECTION RESET)
  4. The Windows read() ends with a ECONNRESET (only after sending all keep-alives, even though WSL already reset the connection 10 seconds before during (2)

Root cause: the actual keep-alive packet is malformed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants