Initialize values for yield normalization during startup #99991

eduardo-vp · 2024-03-20T02:13:03Z

For yield normalization, we currently start by doing 8 iterations of ~1 microsecond measurements of the system yield. These iterations are scheduled lazily so the very first spin waits happen before the first measurements. Because of this, GC initialization and the first GCs can't use the adjusted yield which led to spin waits slower than should have been in some scenarios.

This is an attempt to solve this problem by doing 2 of those 8 iterations during startup (before GC initialization) while keeping the remaining 6 happening lazily so the first spin waits (including the ones done by the GC) can use an adjusted yield.

src/coreclr/vm/yieldprocessornormalized.cpp

kouvel

Thanks!

jkotas · 2024-03-21T20:18:34Z

What is this trying to fix exactly? Do we have any numbers for what this change is trying to fix?

kouvel · 2024-03-21T20:23:46Z

@Maoni0 mentioned that during GC initialization some joins involved when changing the heap count end up spinning too much, as the normalization is not done early enough. This is an attempt to address that by doing a couple of short measurements early.

mangod9 · 2024-03-21T20:26:35Z

Correct, with DATAS enabled NativeAOT benchmarks are showing a startup regression. This should hopefully help with #94076. But would be good to validate that.

jkotas · 2024-03-21T20:27:20Z

How does the time spent in extra spinning compares to the time spent on these eager measurements?

jkotas · 2024-03-21T20:28:22Z

Correct, with DATAS enabled NativeAOT benchmarks are showing a startup regression

In the current state, this change is not changing native AOT.

kouvel · 2024-03-21T20:31:04Z

We were planning on porting this and the newer normalization implementation to NativeAOT as well. NativeAOT currently uses the old normalization scheme.

eduardo-vp · 2024-03-22T01:08:32Z

/azp run runtime

azure-pipelines · 2024-03-22T01:08:59Z

Azure Pipelines successfully started running 1 pipeline(s).

tannergooding · 2024-03-22T20:12:40Z

src/coreclr/vm/yieldprocessornormalized.cpp

-            s_normalizationState = NormalizationState::Failed;
-            return;
+            LARGE_INTEGER li;
+            if (!QueryPerformanceFrequency(&li) || li.QuadPart < 1000 * 1000)


This is somewhat pre-existing to the PR here, but I think there's some latent bugs in the general measurement logic here...

Notably, Linux always reports 1 billion here for the frequency (1 nanosecond ticks) even if the underlying hardware timer is actually much slower.

Additionally, there are places below where we try to measure the length of time between two QPC calls and while QPC is guaranteed to be monotonic (never decreasing) there is no guarantee that two separate calls will not return the same value (which can occur if they happen faster than the underlying timer resolution). A simple program such as the following shows this on most modern hardware:

using System.Runtime.InteropServices; internal unsafe class Program { private static void Main(string[] args) { long start = 0; _ = QueryPerformanceCounter(&start); long end = 0; _ = QueryPerformanceCounter(&end); Console.WriteLine(end - start); // likely prints 0 } [DllImport("kernel32", ExactSpelling = true)] [SuppressGCTransition] static extern int QueryPerformanceCounter(long* lpPerformanceCount); }

Additionally, on some hardware the underlying mechanism used by QPC could have a resolution as low as 10-20 cycles or it could have a latency of nearly 100-300 ns (detailed in part here: https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps). While the pause instruction (x86/x64) can execute somewhere around 10-20 cycles, 40 cycles, or up to 150 cycles (depending on the hardware in question, with newer and higher core count machines often explicitly having longer delay times). All of these cycle times could be potentially much less than the 100-300ns latency that a QPC call might have which can lead to skewed measurements.

We've also seen that yield on Arm64 is known to be a hint which can often be ignored or so low that many other ecosystems have switched to using the isb instruction instead to achieve a similar effect to the x86/x64 pause instruction.

It might be good to take a deeper look at what this code is actually doing and how accurate it actually is compared to mechanisms which are known to more accurately measure the actual instruction latency in a platform specific manner.

Notably, Linux always reports 1 billion here for the frequency (1 nanosecond ticks) even if the underlying hardware timer is actually much slower.

Stopwatch.Frequency also seems to report similarly. The tick count reports ticks in nanoseconds. I don't think this affects the measurements, just the frequency doesn't indicate the actual underlying frequency of the clock. The actual frequency of the clock also does not necessarily reflect the total overhead involved in getting the ticks.

Additionally, there are places below where we try to measure the length of time between two QPC calls and while QPC is guaranteed to be monotonic (never decreasing) there is no guarantee that two separate calls will not return the same value (which can occur if they happen faster than the underlying timer resolution).

I don't see any assumption being made in this code that the difference wouldn't be zero.

Additionally, on some hardware the underlying mechanism used by QPC could have a resolution as low as 10-20 cycles or it could have a latency of nearly 100-300 ns (detailed in part here: https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps). While the pause instruction (x86/x64) can execute somewhere around 10-20 cycles, 40 cycles, or up to 150 cycles (depending on the hardware in question, with newer and higher core count machines often explicitly having longer delay times). All of these cycle times could be potentially much less than the 100-300ns latency that a QPC call might have which can lead to skewed measurements.

Yes that's why the QPC overhead is measured and the measurement duration is adjusted (with some reasonable limits). Previously the measurement was taken over a 10 ms interval but that was too much, and taking multiple shorter measurements as currently done showed similar measurement results even on systems with higher QPC overhead.

We've also seen that yield on Arm64 is known to be a hint which can often be ignored or so low that many other ecosystems have switched to using the isb instruction instead to achieve a similar effect to the x86/x64 pause instruction.

The normalization should be able to figure that out and compensate by using more iterations. isb may be more useful when there is no normalization, to induce a more similar delay. isb appears to have other side effects (involving an instruction barrier) that may be unnecessary for spin-waiting.

This is somewhat pre-existing to the PR here, but I think there's some latent bugs in the general measurement logic here...

Let me know if you see any specific issues, I can take a look.

Notably, Linux always reports 1 billion here for the frequency (1 nanosecond ticks) even if the underlying hardware timer is actually much slower.

For instance, suppose the actual clock resolution is 1 us or higher (quite high, though perhaps possible). Each measurement would then typically be shorter than that duration, with some being shorter and some longer. The chosen measurement that is used for normalization favors a lower ns-per-yield, which would come from the longer measurements and would be a bit closer to the actual delay.

Stopwatch.Frequency also seems to report similarly. The tick count reports ticks in nanoseconds. I don't think this affects the measurements, just the frequency doesn't indicate the actual underlying frequency of the clock. The actual frequency of the clock also does not necessarily reflect the total overhead involved in getting the ticks.

That's because Stopwatch.Frequency just calls into QPF, which is implemented in PAL as a call to the underlying CLOCK_MONOTONIC (or similar), which always reports 1 ns ticks.

The general consideration here is we have logic such as this https://github.com/dotnet/runtime/blob/cf053aba5c42ad94fe696238ff40f28ec82c7362/src/coreclr/vm/yieldprocessornormalized.cpp#L148C13-L153C14

Where the first check (did QPF fail) cannot be triggered on Windows and where the latter check (li.QuadPart < 1_000_000) cannot be triggered on Linux. Given the intent of the check, the general logic here is "flawed" and isn't actually accounting for whether normalization can be properly done. -- This is particularly relevant as we're much more likely to encounter a low clock resolution device on Linux where embedded and low-end/legacy device support is more prevalent.

I don't see any assumption being made in this code that the difference wouldn't be zero.

I think the fact that we aren't explicitly checking for and handling that is the problem.

This call is trying to detect overhead and its registering it as 0:

runtime/src/coreclr/vm/yieldprocessornormalized.cpp

Lines 48 to 56 in cf053ab

QueryPerformanceCounter(&li);

UINT64 startTicks = li.QuadPart;

QueryPerformanceCounter(&li);

UINT64 elapsedTicks = li.QuadPart - startTicks;

if (elapsedTicks >= s_performanceCounterTicksPerS * measureDurationUs * (1000 / 4) / NsPerS) // elapsed >= 1/4 of the measure duration

{

measureDurationUs *= 4;

}

return measureDurationUs;

, thus returning a MeasureDuration of 1us without actually confirming, for example, that 1us is a reasonable measurement duration.

This value is passed into MeasureNsPerYield: https://github.com/dotnet/runtime/blob/cf053aba5c42ad94fe696238ff40f28ec82c7362/src/coreclr/vm/yieldprocessornormalized.cpp#L59C46-L59C63. yieldCount will on many machines simply be 28 and measureDurationTicks will be 10.

On a machine, like the Ryzen 7950X which has documented pause to be 64 cycles and therefore to take ~14ns to execute (base clock of 4.5GHz, can boost up to 5.7GHz), this is generally sufficient and gets roughly the right times. However, I've seen it compute anywhere up to 30ns per pause and down to 10ns, such as if the core it happens to measure is in a low-power state or have just finished a boost state. This variance can cause quite a bit of noise in some benchmarks or even just general app execution.

We then have some logic just below that which tries to account for inaccurate measures.:

runtime/src/coreclr/vm/yieldprocessornormalized.cpp

Lines 103 to 110 in cf053ab

// Limit the minimum to a reasonable value considering that on some systems a yield may be implemented as a no-op

const double MinNsPerYield = 0.1;

// Measured values higher than this don't affect values calculated for normalization, and it's very unlikely for a yield to

// really take this long. Limit the maximum to keep the recorded values reasonable.

const double MaxNsPerYield = YieldProcessorNormalization::TargetMaxNsPerSpinIteration / 1.5 + 1;

return Max(MinNsPerYield, Min((double)elapsedTicks * NsPerS / ((double)yieldCount * ticksPerS), MaxNsPerYield));

However, the numbers are a bit odd as there is no such thing as a 10GHz processor and so 0.1 as a MinNsPerYield is completely unrealistic value. Beyond that, a more typical processor is around 2.5GHz (often more for Desktop or Workstation, sometimes less for Mobile) and so a reasonable yield time (based on actual known data) would likely be at a minimum of around 10 cycles, given around 2-4ns per yield as a reasonable minimum.

The normalization should be able to figure that out and compensate by using more iterations. isb may be more useful when there is no normalization, to induce a more similar delay. isb appears to have other side effects (involving an instruction barrier) that may be unnecessary for spin-waiting.

The main consideration is that if the underlying yield/pause instruction is truly a no-op, then we likely should be falling back to an alternative sequence that better matches the actual intent of the spin loop. If we're literally just executing what is functionally an empty while() loop, that's not great for the system.

There's a lot of cases, particularly for Arm64 systems, where we're seeing tons of context switches compared to an equivalent x64 machine. We're also spending for more time executing instructions in places like the underlying Semaphore spin loop used by the thread pool.

So not only are we spinning and not letting the processor acknowledge the potential for a proper lower power state, but, we're doing a lot of potentially expensive work on the core that could be getting spent elsewhere.

For example, on my Volterra, it's measuring around 0.5-5ns per yield, with some of those being done on BIG cores and some being done on LITTLE cores, which means that separate measurements of yield could be problematic. If you just spin doing MeasureNsPerYield forever, measing the min and max seen over time and the ProcessorId it's running on, for example:

min: 0.4478280340349306 5 max: 0.4478280340349306 6 min: 0.42662116040955633 6 max: 0.5071461502996772 5 min: 0.41736227045075125 6 min: 0.3985651654045436 5 max: 2.345415778251599 6 max: 2.487562189054726 5 min: 0.3838771593090211 6 min: 0.37091988130563797 5 max: 36.47746243739566 6 max: 89.28571428571429 5 max: 173.2193732193732 6 max: 174.28571428571428 0 max: 182.33333333333334 6

With the really high numbers (89 and 173-182) being caused by the OS context switches -- Similar happens on the 7950X:

min: 12.76595744680851 8 max: 12.76595744680851 8 max: 13.924050632911392 8 min: 12.658227848101266 8 min: 11.702127659574469 8 max: 16.949152542372882 8 max: 24.390243902439025 30 max: 26.829268292682926 26 max: 81.69014084507042 8 max: 142.85714285714286 8 max: 182.33333333333334 0

I think these unaccounted for nuances are causing us issues, especially for Arm64 on Linux and are part of why the retired instruction counts, the number of context switches (which are currently more expensive for Arm64), and other metrics are so much higher.

I would imagine some of them would also cause issues for things like Intel Alder Lake CPUs where you have Power and Efficiency cores with very different cycle times and base/max clock speeds.

For example, the same code running on WSL or an actual direct boot to Linux (Ubuntu 22.04) reports the following for Ryzen 7950X:

min: 8.333333333333334 23 max: 8.333333333333334 23 min: 7.046979865771812 23 min: 6.709677419354839 23 min: 6.580645161290323 23 min: 6.538461538461538 23 max: 9.0625 23 max: 10.412371134020619 23 max: 40.796460176991154 23 max: 54.01639344262295 23 max: 102.53571428571429 23 max: 181.78571428571428 23 max: 182.14285714285714 23 max: 182.33333333333334 23 min: 6.5 23 min: 6.4743589743589745 23 min: 6.3354037267080745 23 min: 6.31875 23 min: 6.3125 23 min: 6.273291925465839 22

And the following for Volterra:

min: 0.7791195948578107 2 max: 0.7791195948578107 2 min: 0.5340114431023522 2 min: 0.43294614572333684 2 min: 0.41355658664514827 2 min: 0.3994154895275207 2 min: 0.3990119703591108 2 min: 0.3896736483195324 2 min: 0.38767019667170954 2 min: 0.3806164129223914 2 min: 0.3795242062390077 2 min: 0.37821482602118 2 max: 15.130674002751032 2 max: 32.18707015130674 2 min: 0.3668715032559846 2 max: 37.61467889908257 2 max: 53.211009174311926 2 max: 55.04587155963303 2 max: 166.9724770642202 2 max: 182.33333333333334 2

So the same code on the same hardware is reporting significantly different results in Linux and not measuring something which really does what we're wanting for Arm64 as its taking so little time.

@tannergooding, can you file an issue? We can continue the discussion there.

Logged #100242

This reverts commit 1a90ad8.

src/coreclr/vm/yieldprocessornormalized.cpp

src/coreclr/inc/yieldprocessornormalized.h

eduardo-vp · 2024-04-17T21:14:07Z

@jkotas @Maoni0 I collected some numbers using this test case \clrmain\public\writable\users\maonis\YieldProcessor. This is the branch with the baseline and this is the one with the changes.

The GC did ~30% less YieldProcessor() calls with this change by doing an early measurement of the system yield which took on average 5 us during start up time.

jkotas · 2024-04-17T21:59:37Z

How do these numbers from GCPerf simulator translate to user observable startup perf number?

The spin count measurement should happen fairly quickly in the background. I do not think that you should ever see 100ms-200ms spent by spinning during startup in real world app. If we are actually spending that much time spinning in a real world apps during startup, I think we have a different problem.

I would like us to compare the solution proposed here with a much simpler solution that simply starts with the low spin count multiplier and lets the background measurement to increase it.

kouvel · 2024-04-17T22:10:25Z

I would like us to compare the solution proposed here with a much simpler solution that simply starts with the low spin count multiplier and lets the background measurement to increase it.

I also suggested this as an alternative to @Maoni0, I think it would indeed solve this particular issue. From my measurements it did not take very long for the background measurement to update the GC with that info, in that particular test case, and in the meantime better defaults could suffice. Granted, our defaults in CoreCLR are not great for arm (the GC has separate defaults). A thought is that the default could just be to not spin-wait, but that goes into territory that would need startup time measurements on various kinds of systems.

jkotas · 2024-04-17T22:12:31Z

I would like us to compare the solution proposed here with a much simpler solution that simply starts with the low spin count multiplier and lets the background measurement to increase it.

Something like: 87bbad1

Maoni0 · 2024-04-17T22:30:09Z

this was discovered in asp.net benchmarks, it took 3 GCs for the measurement to come in. I can also easily repro this on my machines with a few GCs go by before the measurement comes in.

I see this has taken way more time than I had anticipated. at this point I'd say I'd just do the "start with a low count" thing in GC (which was what I did as a workaround just to see the effect anyway when this was discovered).

kouvel · 2024-04-18T05:43:25Z

Wanted to clarify some things:

The defaults (which are currently not appropriate for all architectures) become irrelevant with this change, one of the benefits
- There is a very small price to pay for that in startup time (~5 us)
- Maintaining good defaults can be problematic, though there are only a few archs we are dealing with currently, so that is arguable
The current spin-waiting defaults are not appropriate on arm, they are closer to not spin-waiting on arm. I'm not sure that we have tested if doing a reasonable amount of spin-waiting would improve startup time there. Perhaps that is something to be tested.

In any case, the main intent of this change is to alleviate ourselves from having to come up with the correct defaults for every type of system out there, with a small price to pay for that. I think it's a worthy attempt, but I'd like to hear feedback.

jkotas · 2024-04-18T06:50:31Z

a small price to pay

Unless we can measure what we are getting out of the proposed change in real world apps, I would argue that no or very limited spin-waiting during startup is even smaller price to pay and much simpler as well.

kouvel · 2024-04-24T17:31:11Z

Maybe this issue can be closed for now. I've filed a followup issue (#101506) to check the defaults on ARM processors. If there are other cases where this sort of early measurement can help, perhaps they could be tackled as needed.

dotnet-issue-labeler bot added the area-VM-coreclr label Mar 20, 2024

dotnet-policy-service bot assigned eduardo-vp Mar 20, 2024

build-analysis bot mentioned this pull request Mar 20, 2024

Tracking issue for CI build timeouts #76454

Closed

eduardo-vp requested a review from kouvel March 20, 2024 19:47

eduardo-vp force-pushed the add-yield-normalization-initialization branch from 55e0205 to dfe6a14 Compare March 21, 2024 00:49

kouvel reviewed Mar 21, 2024

View reviewed changes

kouvel approved these changes Mar 21, 2024

View reviewed changes

eduardo-vp marked this pull request as ready for review March 22, 2024 00:59

eduardo-vp force-pushed the add-yield-normalization-initialization branch from 11f7c24 to cf053ab Compare March 22, 2024 19:34

tannergooding reviewed Mar 22, 2024

View reviewed changes

build-analysis bot mentioned this pull request Mar 23, 2024

at GC_Regressions._v2_0_beta2_452950_452950_452950_._v2_0_beta2_452950_452950_452950_cmd() #100174

Closed

runfoapp bot mentioned this pull request Mar 25, 2024

Methodical_others test JIT/Methodical/Coverage/copy_prop_byref_to_native_int crashing #69832

Open

tannergooding mentioned this pull request Mar 25, 2024

yieldprocessornormalized appears to be incorrect in some scenarios #100242

Open

eduardo-vp added 2 commits March 25, 2024 13:43

Add initialization for yield normalization

cdebbfe

Fix code

acd40a5

eduardo-vp force-pushed the add-yield-normalization-initialization branch from cf053ab to acd40a5 Compare March 25, 2024 20:43

Add yieldprocessornormalized.h

b37ffcd

build-analysis bot mentioned this pull request Mar 27, 2024

Build linux-x64 Debug Mono_MiniJIT_LibrariesTests failure #99942

Closed

Comment changes in ceemain.cpp to see effect on pipelines

1a90ad8

eduardo-vp added 2 commits March 27, 2024 11:20

Revert "Comment changes in ceemain.cpp to see effect on pipelines"

13c0161

This reverts commit 1a90ad8.

Check IsGCHeapInitialized

ea46a91

kouvel reviewed Mar 28, 2024

View reviewed changes

src/coreclr/vm/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved

src/coreclr/inc/yieldprocessornormalized.h Outdated Show resolved Hide resolved

Rename method

c3ffb0a

eduardo-vp closed this Apr 24, 2024

github-actions bot locked and limited conversation to collaborators May 25, 2024

eduardo-vp deleted the add-yield-normalization-initialization branch May 29, 2024 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize values for yield normalization during startup #99991

Initialize values for yield normalization during startup #99991

eduardo-vp commented Mar 20, 2024 •

edited

Loading

kouvel left a comment

jkotas commented Mar 21, 2024

kouvel commented Mar 21, 2024

mangod9 commented Mar 21, 2024

jkotas commented Mar 21, 2024

jkotas commented Mar 21, 2024 •

edited

Loading

kouvel commented Mar 21, 2024

eduardo-vp commented Mar 22, 2024

azure-pipelines bot commented Mar 22, 2024

tannergooding Mar 22, 2024

kouvel Mar 23, 2024

kouvel Mar 24, 2024

tannergooding Mar 24, 2024 •

edited

Loading

tannergooding Mar 24, 2024 •

edited

Loading

kouvel Mar 25, 2024

tannergooding Mar 25, 2024

eduardo-vp commented Apr 17, 2024 •

edited

Loading

jkotas commented Apr 17, 2024

kouvel commented Apr 17, 2024

jkotas commented Apr 17, 2024

Maoni0 commented Apr 17, 2024

kouvel commented Apr 18, 2024

jkotas commented Apr 18, 2024

kouvel commented Apr 24, 2024

	QueryPerformanceCounter(&li);
	UINT64 startTicks = li.QuadPart;
	QueryPerformanceCounter(&li);
	UINT64 elapsedTicks = li.QuadPart - startTicks;
	if (elapsedTicks >= s_performanceCounterTicksPerS * measureDurationUs * (1000 / 4) / NsPerS) // elapsed >= 1/4 of the measure duration
	{
	measureDurationUs *= 4;
	}
	return measureDurationUs;

	// Limit the minimum to a reasonable value considering that on some systems a yield may be implemented as a no-op
	const double MinNsPerYield = 0.1;

	// Measured values higher than this don't affect values calculated for normalization, and it's very unlikely for a yield to
	// really take this long. Limit the maximum to keep the recorded values reasonable.
	const double MaxNsPerYield = YieldProcessorNormalization::TargetMaxNsPerSpinIteration / 1.5 + 1;

	return Max(MinNsPerYield, Min((double)elapsedTicks * NsPerS / ((double)yieldCount * ticksPerS), MaxNsPerYield));

Initialize values for yield normalization during startup #99991

Initialize values for yield normalization during startup #99991

Conversation

eduardo-vp commented Mar 20, 2024 • edited Loading

kouvel left a comment

Choose a reason for hiding this comment

jkotas commented Mar 21, 2024

kouvel commented Mar 21, 2024

mangod9 commented Mar 21, 2024

jkotas commented Mar 21, 2024

jkotas commented Mar 21, 2024 • edited Loading

kouvel commented Mar 21, 2024

eduardo-vp commented Mar 22, 2024

azure-pipelines bot commented Mar 22, 2024

tannergooding Mar 22, 2024

Choose a reason for hiding this comment

kouvel Mar 23, 2024

Choose a reason for hiding this comment

kouvel Mar 24, 2024

Choose a reason for hiding this comment

tannergooding Mar 24, 2024 • edited Loading

Choose a reason for hiding this comment

tannergooding Mar 24, 2024 • edited Loading

Choose a reason for hiding this comment

kouvel Mar 25, 2024

Choose a reason for hiding this comment

tannergooding Mar 25, 2024

Choose a reason for hiding this comment

eduardo-vp commented Apr 17, 2024 • edited Loading

jkotas commented Apr 17, 2024

kouvel commented Apr 17, 2024

jkotas commented Apr 17, 2024

Maoni0 commented Apr 17, 2024

kouvel commented Apr 18, 2024

jkotas commented Apr 18, 2024

kouvel commented Apr 24, 2024

eduardo-vp commented Mar 20, 2024 •

edited

Loading

jkotas commented Mar 21, 2024 •

edited

Loading

tannergooding Mar 24, 2024 •

edited

Loading

tannergooding Mar 24, 2024 •

edited

Loading

eduardo-vp commented Apr 17, 2024 •

edited

Loading