Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adjust ConnMgr to 32/96 #9483

Merged
merged 1 commit into from
Dec 8, 2022
Merged

Conversation

lidel
Copy link
Member

@lidel lidel commented Dec 8, 2022

This is follow-up for #9467

Rationale

#9467 (comment) from @BigLep :

I sanity checked the number of Low=50, High=150

By default if a Kubo node has 8GB of total memory, a maxMemory for libp2p will be set to 2GB. That will cause System.ConnsInbound to be 128 I believe (looks like 64 per GB of memory here: https://github.com/libp2p/go-libp2p/blob/master/p2p/host/resource-manager/limit_defaults.go#L345 since Kubo is using those same libp2p defaults: https://github.com/ipfs/kubo/blob/master/core/node/libp2p/rcmgr_defaults.go#L94). As a result, in this particular setup, our high water mark may not get hit (unless there are sufficient outbound connections). I know there isn't a perfect number here and we have no stats on the systems that folks are using. I would be in favor of dropping the ConnMgr.High to 120 instead.

@lidel lidel requested a review from BigLep December 8, 2022 15:22
@lidel
Copy link
Member Author

lidel commented Dec 8, 2022

@BigLep thank you for looking into this.

The number of connections at a time will usually be slightly higher than HighWater (highWater triggers GC, but in the meantime new peers connected to us because we are dhtserver, and next gracePeriod is 1-20s away).

For example, setting HighWater to 96 eventually typically hovers between 150-170.

I've set it to 120 as you suggested but suggest going with bit lower values (32/96) if we really want to move average connection count down – lmk if I should update this PR to 32/96, or keep it 50/120.

@Jorropo
Copy link
Contributor

Jorropo commented Dec 8, 2022

Looks like we should figure out scaling this based on ressource manager or system memory aswell.

@BigLep
Copy link
Contributor

BigLep commented Dec 8, 2022

Interesting. I don't know enough here and I'm not sure the full ramification of the tradeoff. For example, we see really high success rates of finding content over bitswap (https://www.notion.so/pl-strflt/Bitswap-measurements-draft-report-c64694c1a9ae4389a86e34e7bf2baf6e#e0b2e3258b304ae0b85bfcfba8c8db41 ) but that is also with a lot of connected peers.

We have a strong datapoint of what has been working for Brave for a year plus now which his what makes me comfortable to drop it from the previous really high values.

Maybe just go with 120 now and we get probelab engagement more for future?

@Jorropo
Copy link
Contributor

Jorropo commented Dec 8, 2022

@BigLep imo we should not look at the bitswap flood success rate and try to chase that.

This comes at the cost of flooding the network with thousands of requests per second which create lots of load, we will stop flooding the network in Q1 or Q2 of 2023 when the work of move the bytes is released mainstream.
(To be clear I'm not saying we we will stop flooding completely, but either we will find an efficient way to do that and or we will add metrics to only flood / optimistically request nodes that most likely have content (like pinning services).)

@BigLep
Copy link
Contributor

BigLep commented Dec 8, 2022

@Jorropo : ack understood. I'm not saying we should keep flooding. The area of uncertainty for me is around how much unexpected benefit are we getting from flooding and the potential impact of reducing it without having something else filling in like new transports from Move the Bytes.

I guess the strongest datapoint we have right now is that Brave/Desktop have lower limits and they work. As a result, I'm comfortable reducing the numbers to 32/96 as you suggest Lidel and we can increase if we got this wrong.

@lidel
Copy link
Member Author

lidel commented Dec 8, 2022

Ack, let's try 32/96 in RC1, we can adjust in RC2 if needed.

@lidel lidel changed the title fix: adjust DefaultConnMgrHighWater to 120 fix: adjust ConnMgr to 32/96 Dec 8, 2022
@lidel lidel force-pushed the feat/even-lower-connection-pool branch from 953005f to fde94af Compare December 8, 2022 20:36
@lidel lidel changed the title fix: adjust ConnMgr to 32/96 feat: adjust ConnMgr to 32/96 Dec 8, 2022
@lidel lidel enabled auto-merge (squash) December 8, 2022 20:38
@lidel lidel merged commit 1f63640 into master Dec 8, 2022
@lidel lidel deleted the feat/even-lower-connection-pool branch December 8, 2022 20:47
@guillaumemichel
Copy link

I don't have much visibility on what System.ConnsInbound corresponds to. According to the Draft Bitswap Measurements report, on average a Bitswap successful request solicits 820 peers. However, I am not sure whether these connections are considered as Inbound or Outbound.

We also observed that most of the content was served by a small number of peers, and most of them are nft.storage and other large providers. Hence, these large providers may have different Connection Manager settings adapted to serve content to many peers.

Top 1 peers serve 10.78% of the successful bitswap requests
Top 3 peers serve 26.07% of the successful bitswap requests
Top 5 peers serve 38.56% of the successful bitswap requests
Top 10 peers serve 59.07% of the successful bitswap requests
Top 20 peers serve 76.64% of the successful bitswap requests
Top 50 peers serve 85.65% of the successful bitswap requests
Top 355 peers serve 98.64% of the successful bitswap requests
Top 711 peers serve 100.0% of the successful bitswap requests

@lidel
Copy link
Member Author

lidel commented Jan 4, 2023

@guillaumemichel System.ConnsInbound is the limit set by go-libp2p's resource manager (Kubo's config/docs: Swarm.ResourceMgr).

The direction of active connection can be inspected via ipfs swarm peers -v. My understanding is that only informs who initiated the bidirectional connection. The existing connection can be reused later, so it is possible the "incoming" connection is used later for outgoing bitswap queries.

My gut feeling is this is unlikely, unless some peering agreement was set up between peers.
That is to say, in most cases, ConnsInbound "limits other peers from using my resources for dht/bitswap", and not my own ability to use other peers (iirc outgoing connections have no limit)

@guillaumemichel
Copy link

I think that reducing the System.ConnsInbound is the right thing to do concerning go-bitswap. It will reduce spam 8x from nodes that have the reduced System.ConnsInbound.

This will likely result in a drop of the Bitswap discovery success rate, but it isn't a big deal as content routers (DHT, IPNI) are fast. However reducing Bitswap ProviderSearchDelay from 1s to 0 seems critical as described in #8807. The Effectiveness of Bitswap Discovery Process report discusses improvements suggestions for the Bitswap discovery process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants