-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ping peers on routing table refresh #810
Conversation
We have seen in the past that there are peers in the IPFS DHT that let you connect to them but then refuse to speak any protocol. This was mainly due to resource manager killing the connection if limits were exceeded. We have seen that such peers are pushed to the edge of the DHT - meaning, they get already pruned from lower buckets. However, they won't get pruned from higher ones, because we only try to connect to them and not speak anything on that connection. This change adds a ping message to the liveness check on routing table refreshes.
758f56d
to
eb6d2f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Would be better with a test but this is very annoying to test (annoying to write and annoying to maintain because of the cost to update mocks), this is most likely work, so I'm fine without a test.
IMO we don't need to periodically check whether nodes still answer to DHT queries as expected. Preventing unresponsive nodes from being added to the RT should be sufficient. See #811 |
Periodically interacting with peers beyond just connecting to them would detect if they have become overwhelmed with requests (e.g., reached their resource manager limits). This can happen over time so I don't think just checking once upon inserting them to our routing table is enough. |
Another remark: I'm using the Just wanted to point out that this means we require all peers in the network to speak that other Opinions @guseggert ? |
Actually, using the ping endpoint allows to ensure the dht protocol stream limits worked at some point, you might have very high ping protocol limits but very low DHT limits, I think the deprecation has something to do with rust-libp2p that implement it because it is unused or something ? (cc @mxinden ) If using the dht ping endpoint is actually fine, we should revert my request to use the ping protocol (this will allows to test the correct per protocol stream limits). |
Why not using directly a DHT findpeers request? |
Why do a more expensive request when a simpler one does the job ? The kadamelia ping and a findpeer will excercise mostly the same codepaths through the stack (only differences should be in the kadamelia handler, you would switch on the message type). |
This reverts commit 16823c3.
397805c
to
5e68507
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Would be better with a test but this is very annoying to test (annoying to write and annoying to maintain because of the cost to update mocks), this is most likely work, so I'm fine without a test.
From our discussion a few minutes ago:
|
Yes, please don't use the deprecated Kademlia Ping. See specification:
https://github.com/libp2p/specs/tree/master/kad-dht
I don't recall the exact reasoning. This has been way before my time. Though this is my intuition, yes. #31 might hep a bit.
I am not aware of any way the deprecation is related to rust-libp2p. |
dht.go
Outdated
@@ -365,10 +365,15 @@ func makeRtRefreshManager(dht *IpfsDHT, cfg dhtcfg.Config, maxLastSuccessfulOutb | |||
return err | |||
} | |||
|
|||
pingFnc := func(ctx context.Context, p peer.ID) error { | |||
return dht.protoMessenger.Ping(ctx, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the
ping
package to probe the remote peer instead of the ping message from the/ipfs/kad/1.0.0
protocol
Not deeply familiar with the codebase. Just double checking, is this really not using the Kademlia Ping mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After yesterdays discussion, we changed it back to the Kademlia PING message. So this is indeed the Kademlia Ping.
Nice, thanks for the clarification. Great to have the information from somewhere authoritative, although it's still unclear what the reasoning there was. Now, we have three options:
OR
OR
I'd vote for |
First off, do we have agreement that this is a temporary hack? I.e. that this is to work around existing nodes with miss-configured resource manager? And that the long term solution is to somehow upgrade these nodes?
Do I understand correctly that they would be pruned from the routing table whenever we send an RPC (e.g.
If indeed this is a temporary fix only, I am reluctant to change the Kademlia specification for it. |
For me, this is not a temporary hack. It's another safeguard against misconfigured nodes. We're not preventing any attacks here. However, I totally agree that the priority is fixing the root cause. We have already put things in motion to do that. Provably, the network has significantly picked up on our proposed changes: https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-5/ipfs/README.md#agents, and we expect things to improve in the near future even more. As a follow-up step, I'm also in favour of doing a similar check upon insertion of a peer to the routing table as proposed by @guillaumemichel in #811. With both of these changes, we could have notably mitigated the current performance hit to DHT lookup latencies.
That's correct. For our current resource manager challenges, we would not only speed up this process but actually just begin to prune them at all. Right now, these unresponsive nodes we observe stay in routing tables basically forever. |
No, we want to prevent this kind of problem from happening again in the future (node misconfiguration, resource manager, implementation bugs or any other reason that we cannot predict).
I wouldn't say that it is a process speed up. If the peerids close to you are unresponsive to DHT queries, but responsive to ping, in the current state they are never pruned. A node (almost) never sends DHT queries to remote nodes close to itself, as they probably store the same Provider Records as the node itself, and the probability that the content you try to access is provided by a remote node very close to you (in XOR distance) is very small.
+1 |
Good point. I did not consider this.
For what my opinion is worth, this sounds reasonable to me. Long term I would still wish for this to no longer be needed, i.e. I would wish for the majority of nodes to properly answer Kademlia requests when they advertise support for the Kademlia protocol. Though that might just be wishful thinking. Thanks for expanding here @dennis-tra and @guillaumemichel. |
Seems that everyone is fine with the current patch, I'll merge by the end of the day unless someone complain. |
@dennis-tra : are you getting this bubbled up into Kubo? |
I was surprirsed not see a corresponding update to https://github.com/libp2p/go-libp2p-kad-dht/commits/master/version.json but I see @Jorropo did this. Now we just make sure this bubbles up. |
@BigLep version v0.21.0 has been created for go-libp2p v0.25 |
We have seen in the past that there are peers in the IPFS DHT that let you connect to them but then refuse to speak any protocol. This was mainly due to the resource manager killing the connection if limits were exceeded. We have seen that such peers are already pushed to the edge of the DHT - meaning they get pruned from lower buckets. However, they won't get pruned from higher ones because we only try to connect to them and not speak anything on that connection.
This change adds a ping message to the liveness check on routing table refreshes.