Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ping peers on routing table refresh #810

Merged
merged 5 commits into from
Feb 11, 2023

Conversation

dennis-tra
Copy link
Contributor

@dennis-tra dennis-tra commented Feb 3, 2023

We have seen in the past that there are peers in the IPFS DHT that let you connect to them but then refuse to speak any protocol. This was mainly due to the resource manager killing the connection if limits were exceeded. We have seen that such peers are already pushed to the edge of the DHT - meaning they get pruned from lower buckets. However, they won't get pruned from higher ones because we only try to connect to them and not speak anything on that connection.

This change adds a ping message to the liveness check on routing table refreshes.

We have seen in the past that there are peers in the IPFS DHT that let you connect to them but then refuse to speak any protocol. This was mainly due to resource manager killing the connection if limits were exceeded. We have seen that such peers are pushed to the edge of the DHT - meaning, they get already pruned from lower buckets. However, they won't get pruned from higher ones, because we only try to connect to them and not speak anything on that connection.

This change adds a ping message to the liveness check on routing table refreshes.
dht.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Would be better with a test but this is very annoying to test (annoying to write and annoying to maintain because of the cost to update mocks), this is most likely work, so I'm fine without a test.

@guillaumemichel
Copy link
Contributor

IMO we don't need to periodically check whether nodes still answer to DHT queries as expected. Preventing unresponsive nodes from being added to the RT should be sufficient.

See #811

@dennis-tra
Copy link
Contributor Author

Periodically interacting with peers beyond just connecting to them would detect if they have become overwhelmed with requests (e.g., reached their resource manager limits). This can happen over time so I don't think just checking once upon inserting them to our routing table is enough.

@dennis-tra
Copy link
Contributor Author

Another remark:

I'm using the ping package to probe the remote peer instead of the ping message from the /ipfs/kad/1.0.0 protocol because @Jorropo, you said we wanted to get rid of the ping message from the /ipfs/kad/1.0.0 protocol. What was the reasoning here again? Is it just because we have a dedicated ping package/protocol for that?

Just wanted to point out that this means we require all peers in the network to speak that other ping protocol which makes the /ipfs/kad/1.0.0 protocol dependent on it. Not sure if I like such inter-protocol dependencies.

Opinions @guseggert ?

@Jorropo
Copy link
Contributor

Jorropo commented Feb 6, 2023

Actually, using the ping endpoint allows to ensure the dht protocol stream limits worked at some point, you might have very high ping protocol limits but very low DHT limits, I think the deprecation has something to do with rust-libp2p that implement it because it is unused or something ? (cc @mxinden )


If using the dht ping endpoint is actually fine, we should revert my request to use the ping protocol (this will allows to test the correct per protocol stream limits).

@guillaumemichel
Copy link
Contributor

Why not using directly a DHT findpeers request?

@Jorropo
Copy link
Contributor

Jorropo commented Feb 6, 2023

Why not using directly a DHT findpeers request?

Why do a more expensive request when a simpler one does the job ? The kadamelia ping and a findpeer will excercise mostly the same codepaths through the stack (only differences should be in the kadamelia handler, you would switch on the message type).

Copy link
Contributor

@Jorropo Jorropo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Would be better with a test but this is very annoying to test (annoying to write and annoying to maintain because of the cost to update mocks), this is most likely work, so I'm fine without a test.

@dennis-tra
Copy link
Contributor Author

From our discussion a few minutes ago:

  • it actually doesn't really matter to do ping / find_node
  • ping saves minimal CPU cycles
  • As long as we don't do anything with the find node response we could just use ping
  • if we plan to add further peer verification we could easily change this to a find_node request

@mxinden
Copy link
Member

mxinden commented Feb 8, 2023

I'm using the ping package to probe the remote peer instead of the ping message from the /ipfs/kad/1.0.0 protocol

Yes, please don't use the deprecated Kademlia Ping. See specification:

  • PING: Deprecated message type replaced by the dedicated [ping
    protocol][ping]. Implementations may still handle incoming PING requests for
    backwards compatibility. Implementations must not actively send PING
    requests.

https://github.com/libp2p/specs/tree/master/kad-dht

Is it just because we have a dedicated ping package/protocol for that?

I don't recall the exact reasoning. This has been way before my time. Though this is my intuition, yes. #31 might hep a bit.

think the deprecation has something to do with rust-libp2p that implement it because it is unused or something ? (cc @mxinden )

I am not aware of any way the deprecation is related to rust-libp2p.

dht.go Outdated
@@ -365,10 +365,15 @@ func makeRtRefreshManager(dht *IpfsDHT, cfg dhtcfg.Config, maxLastSuccessfulOutb
return err
}

pingFnc := func(ctx context.Context, p peer.ID) error {
return dht.protoMessenger.Ping(ctx, p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the ping package to probe the remote peer instead of the ping message from the /ipfs/kad/1.0.0 protocol

Not deeply familiar with the codebase. Just double checking, is this really not using the Kademlia Ping mechanism?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After yesterdays discussion, we changed it back to the Kademlia PING message. So this is indeed the Kademlia Ping.

@dennis-tra
Copy link
Contributor Author

Yes, please don't use the deprecated Kademlia Ping. See specification:

Nice, thanks for the clarification. Great to have the information from somewhere authoritative, although it's still unclear what the reasoning there was. Now, we have three options:

  1. use the ping package
    • wouldn't exercise DHT protocol resource manager limits - however, that's partially exactly what I would want to test here)
    • puts a dependency from the DHT protocol on the ping package. Every peer speaking the DHT protocol, now must speak the ping protocol as well.

OR

  1. Just probe with a FIND_NODES instead of a PING message. Yesterday, we said it doesn't really matter what we use. We chose PING to just save some CPU cycles.

OR

  1. Change the spec :D

I'd vote for 2..

@mxinden
Copy link
Member

mxinden commented Feb 10, 2023

First off, do we have agreement that this is a temporary hack? I.e. that this is to work around existing nodes with miss-configured resource manager? And that the long term solution is to somehow upgrade these nodes?

However, they won't get pruned from higher ones because we only try to connect to them and not speak anything on that connection.

Do I understand correctly that they would be pruned from the routing table whenever we send an RPC (e.g. FINDNODE) to them AND they don't respond? If so, the periodic test RPC would just speed up this pruning process, correct?

Change the spec :D

If indeed this is a temporary fix only, I am reluctant to change the Kademlia specification for it.

@dennis-tra
Copy link
Contributor Author

First off, do we have agreement that this is a temporary hack?

For me, this is not a temporary hack. It's another safeguard against misconfigured nodes. We're not preventing any attacks here.

However, I totally agree that the priority is fixing the root cause. We have already put things in motion to do that. Provably, the network has significantly picked up on our proposed changes: https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-5/ipfs/README.md#agents, and we expect things to improve in the near future even more.

As a follow-up step, I'm also in favour of doing a similar check upon insertion of a peer to the routing table as proposed by @guillaumemichel in #811. With both of these changes, we could have notably mitigated the current performance hit to DHT lookup latencies.

Do I understand correctly that they would be pruned from the routing table whenever we send an RPC (e.g. FINDNODE) to them AND they don't respond? If so, the periodic test RPC would just speed up this pruning process, correct?

That's correct. For our current resource manager challenges, we would not only speed up this process but actually just begin to prune them at all. Right now, these unresponsive nodes we observe stay in routing tables basically forever.

@guillaumemichel
Copy link
Contributor

First off, do we have agreement that this is a temporary hack? I.e. that this is to work around existing nodes with miss-configured resource manager? And that the long term solution is to somehow upgrade these nodes?

No, we want to prevent this kind of problem from happening again in the future (node misconfiguration, resource manager, implementation bugs or any other reason that we cannot predict).

If so, the periodic test RPC would just speed up this pruning process, correct?

I wouldn't say that it is a process speed up. If the peerids close to you are unresponsive to DHT queries, but responsive to ping, in the current state they are never pruned. A node (almost) never sends DHT queries to remote nodes close to itself, as they probably store the same Provider Records as the node itself, and the probability that the content you try to access is provided by a remote node very close to you (in XOR distance) is very small.

Right now, these unresponsive nodes we observe stay in routing tables basically forever.

+1

@mxinden
Copy link
Member

mxinden commented Feb 10, 2023

A node (almost) never sends DHT queries to remote nodes close to itself, as they probably store the same Provider Records as the node itself, and the probability that the content you try to access is provided by a remote node very close to you (in XOR distance) is very small.

Good point. I did not consider this.

Just probe with a FIND_NODES instead of a PING message. Yesterday, we said it doesn't really matter what we use. We chose PING to just save some CPU cycles.

For what my opinion is worth, this sounds reasonable to me. Long term I would still wish for this to no longer be needed, i.e. I would wish for the majority of nodes to properly answer Kademlia requests when they advertise support for the Kademlia protocol. Though that might just be wishful thinking.

Thanks for expanding here @dennis-tra and @guillaumemichel.

@Jorropo
Copy link
Contributor

Jorropo commented Feb 10, 2023

Seems that everyone is fine with the current patch, I'll merge by the end of the day unless someone complain.

@Jorropo Jorropo merged commit 81e7325 into libp2p:master Feb 11, 2023
@BigLep
Copy link

BigLep commented Feb 13, 2023

@dennis-tra : are you getting this bubbled up into Kubo?

@BigLep
Copy link

BigLep commented Feb 13, 2023

I was surprirsed not see a corresponding update to https://github.com/libp2p/go-libp2p-kad-dht/commits/master/version.json but I see @Jorropo did this. Now we just make sure this bubbles up.

@dennis-tra dennis-tra deleted the ping-on-refresh branch February 13, 2023 19:06
@Jorropo
Copy link
Contributor

Jorropo commented Feb 14, 2023

@BigLep version v0.21.0 has been created for go-libp2p v0.25
Capture d’écran du 2023-02-14 09-14-11
I'll bubble

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants